Teknikal's_Domain

#<NTA:NnT:SSrgS:H6.6-198:W200-90.72:CBWg>

Graylog, and the Syslog Protocol, Explained

2021-02-19 9 min read Tech explained Teknikal_Domain Unable to load comment count

Table of Contents

So if you’ve tried enterprise log management systems, you’ve likely heard of Syslog. If you haven’t, Syslog, is, well, a protocol designed to allow multiple hosts to send their system logs over the network to some other server where they can be analyzed and stored. It’s another one of those weird UDP protocols, and this one is actually stupid simple, even in both of the commonly used forms! Oh, we’ll also cover the one piece of software that I use that handles Syslog — Graylog, which by itself is also really cool.

Syslog

Syslog was first documented in RFC 3164, but was standardized in RFC 5424. While some systems, like HAProxy, default to using the 3164 format unless specified, the 5424 format is the one that’s the most widely used at this point. Both are textual formats, with a single log message per “line” in the protocol. Both are designed to be small, fast, machine- and human-readable, and simple to encode and decode.

RFC 3164 Format

An example 3164 format Syslog message looks like this:

<34>Oct 11 22:14:15 mymachine su[515]: 'su root' failed for lonvick on /dev/pts/8

This message contains one weird value, <34>, but the rest should make sense:

  • It was generated on the 11th of October, at 10:14:15 PM (The year is omitted here, and would be interpreted to be “current year” when the message is processed)
  • The log message was generated on a machine with hostname mymachine
  • The process the log came from was su with process ID 515
  • The log message itself is 'su root' failed for lonvick on /dev/pts/8

The <34> is the “PRI” part, which shows the facility and the severity of the message. Where “facility” is $s$, and “severity” is $l$, the calculation for PRI $p$ is $p = 8s + l$. For that to make sense, each of those two fields is a number.

Facilities

The “facility” is the sub-system that a message belongs to, and there is a standard code:

  1. kern — Kernel messages
  2. user — User-level messages
  3. mail — email system (receiving, sending, or just overall)
  4. daemon — System daemon logs
  5. auth — Security and/or authentication logs
  6. syslog — Internally generated messages by the Syslog handler
  7. lpr — Printer subsystem
  8. news — Network news (NNTP, Usenet) subsystem
  9. uucpUUCP messages
  10. cron — Scheduled job messages
  11. authpriv — “privileged” security / authentication logs
  12. ftpFTP logs
  13. ntpNTP logs
  14. security — Audit logs
  15. console — No clue?
  16. solaris-croncron except on Solaris systems??

Also, 16–23 are local0 through local7, providing basically 8 unmapped and unstandardized facilities.

Severity

The “severity” of a message is somewhat relative to the application that generated it, but they show a message’s general importance. Again, here’s the list:

  1. emerg — Panic condition; system is unusable or otherwise completely down
  2. alert — Immediate action required, e.g., corrupt database file
  3. crit — Critical alert
  4. error — General error
  5. warn — (Generally non-fatal) warning
  6. notice — Normal operating conditions, but of special note
  7. info — Extra, informational messages
  8. debug — Debugging-level messages

I kid you not, there is a mnemonic for this: Every Awesome Cisco Engineer Will Need Icecream Daily.

Anyways, to give an example, “startup complete, process ready” is likely an info message, but “Failed to open socket: address in use” is probably higher, like error or crit depending on how the program in question organizes its messages. As you’ll notice almost all the levels here are general to some point, it’s up to the program to decide what goes where, there’s no rigid definitions like there are for the facilities.


Getting back to the PRI portion of that example message, which was <34>, we know that $\lfloor34 / 8\rfloor = 4$, meaning that the facility is auth, which is what you’d expect for su, and the priority, $34 \mod 8 = 2$, is crit. Therefore, this is a critical message from the authentication subsystem. It’s a little weird, but that’s what they’re using. Yes, this means that <0> is valid, as an emergency-level kernel message, and the maximum is <191>, or a debugging log from local7.

RFC 5464 Format

<34>1 2003-10-11T22:14:15.003Z mymachine.example.com su 515 ID47 - 'su root' failed for lonvick on /dev/pts/8

You’ll notice that the PRI part is the same as before, but now there’s a 1 at the end, which specifies the protocol version: version 1, and only. The timestamp is now a RFC 3339 timestamp, which is a formalization of ISO 8601.1 The log message is now in Unicode, meaning that before 'su root', there should be a BOM for a Syslog server to know how to read the data. The process ID field is now separated by a space, meaning that 515 is still the PID. Also, ID47 is a tag called the MSGID, which is just a string that can be used for filtering messages. It by itself has no further meaning, and can be pretty much anything. But as a general rule, if multiple messages have the same MSGID, they should be somewhat correlated. For example, a firewall that marks dropped incoming packet logs with a MSGID of TCPINDROP. The - seen in the message is a placeholder for “not present”, in this case, the message is missing what’s called “structured data.”

Structured Data (SD)

SD is a set of key=value pairs that can be sent in a 5464 format Syslog message, some keys of which are standardized. As an example, here’s a message with one SD element:

<165>1 2003-10-11T22:14:15.003Z mymachine.example.com eventslog - ID47 [[email protected] iut="3" eventSource="Application" eventID="1011"] An application event log entry...

Note how, in this one, there is no PID (a - after the app name), and the what would have been second - is now the structured data, which in this case, is a singular element. This element’s name, or (SD-)ID, is [email protected], which we’ll get to in a second, and it contains three parameters:

  • iut which is 3
  • eventSource which is Application
  • eventID which is 1011

If there were multiple elements, you’d see multiple [id param="value"] parts, like this:

[[email protected] iut="3" eventSource="Application" eventID="1011"][[email protected] class="high"]

There are some standardized SD-IDs, like timeQuality or origin, but anyone is free to make their own, with one restriction. Custom SD-IDs have a particular format to them: [email protected], where the name is whatever you’ve named it, but the pen is your Private Enterprise Number, assigned by IANA. This makes sure that any SD-IDs that an enterprise uses in their Syslogs won’t conflict with any others, or with any that later become reserved as future parts of the specification.


So the 5464 format is still plenty easy to understand with a tiny bit of explanation, but all the critical parts really need no explanation, maybe except for the facility and severity. Note that most systems that I’ve used don’t use structured data, meaning, while it may be used, I’ve not found something that does use it, at this point.

Also as said before, Syslog usually runs over UDP, since the messages are small (literally one log line per message), and it’s not that critical that you have reliable delivery, but if you want it, Syslog-over-TCP is widely supported, whereby a long-running TCP connection is created, and Syslog messages are passed along periodically as they are generated. TLS can be used if you want to be extra secure, and by itself, the protocol has no security or hardening against replay attacks, if… you want to replay log messages.

Most Linux systems today have some internal Syslog driver, for example, all my containers come with rsyslog, which is a local syslog server, that takes in Syslog messages that systemd produces for everything, and writes it out to the appropriate /var/log files, although a few configuration directives can make it send all Syslog messages over to a remote server as well, using either format, using either UDP or TCP. And in my network, I use this a lot, since all hosts send their messages to…

Graylog

Graylog is a central log management program, which is built on top of ElasticSearch, which is a rather fast JSON document indexing engine. Graylog is capable of handling more than just Syslog, but that’s what I mainly have it doing, ingesting all the Syslog events from my network, and then indexing them neatly. Graylog can be run clustered, with multiple machines all taking part in message indexing and message storage.

Graylog is pretty simple, conceptually: You have a set of “inputs”, like Syslog, FileBeats , GELF, or NetFlow, a set of processing steps, usually called “extractors” on the inputs, which are then sent to a “stream”, and can optionally, from there, be sent to an “output”.

Each stream can have a set of rules that messages must pass to be in part of said stream, and each stream is backed by an ElasticSearch index, though multiple streams can share the same index, no single stream can be backed by multiple indices. Indexes can have their own configuration for index file rotation, message retention, shards, copies, redundancy, all that fun stuff.

It can get complex, but it doesn’t have to be. For example, I have one Syslog input per container (meaning, one designated port per container), and each input has a series of “extractors” that can take elements out and separate them into their own separate message fields. For example, HAProxy has a ton of extractors on it, each RegEx’d to either activate on the HTTP or TCP log lines, which pull out things like timing states, disconnect states, and all sorts of things, into distinct fields, meaning I can then execute queries against them. You can also configure alerting, for example, if X amount of logs matching Y search were found in Z period of time, or instead, make nice dashboards and graphs of what it’s taking in.

Graylog is a Java application though, meaning you have that to deal with. Although, it’s perfectly capable of handling multiple thousands of Syslog messages, per second, and it barely touched the disk journal (basically its queue of unprocessed events) when it did that… all keeping the JVM heap at a maximum size of 1 GiB, no less.

Also of note, so-called “sidecars” exist, which is a little program that can be installed on any host, which will read its configuration from Graylog, and configure additional inputs. For example, I can configure a sidecar to start FileBeats with a pre-determined set of files to look at, and point it at an appropriate FileBeats input on the server, and now, any log files that don’t go through Syslog can still be sent, line by line, in real time, for processing, just like Syslog. Cool!


Really, the Syslog protocol, both formats, are nice if you want to have one system be able to send its current log output to another system, live, either for analysis or storage, and Graylog is a free (for non-enterprise) program capable of doing just that, and even capable of supporting an entire network’s log output in just 1 GiB of JVM heap with no extra help, constantly.


  1. Here, the message has a fractional seconds part, 3 milliseconds past the second, and a timezone of Z, for “Zulu”, which is short for UTC. Even the military refers to UTC / GMT as “Zulu time.” ↩︎

comments powered by Disqus