Logs parsing config

Typically, as the logs are ingested into the Kloudfuse stack, they go through a log pipeline which apply various rules (such as grammar) or extract log facets and labels/tags automatically from an input log line.

Parsing configuration example

Most of the phases in the logs pipeline are configurable, and you can configure them in your custom values.yaml file. Here’s what a sample configuration looks like:

logs-parser: kf_parsing_config: configPath: "/conf" config: |- parser_patterns: - dissect: timestamp_pat: "<REGEX>" - grok: NGINX_HOST: (?:%{IP:destination_ip}|%{NGINX_NOTSEPARATOR:destination_domain})(:%{NUMBER:destination_port})? - remap: args: kf_source: - "$.logSource" conditions: - matcher: "__kf_agent" value: "fluent-bit" op: "==" - parser: dissect: args: - tokenizer: '%{timestamp} %{level} [LLRealtimeSegmentDataManager_%{segment_name}]' conditions: - matcher: "%kf_msg" value: "LLRealtimeSegmentDataManager_" op: "contains"

Note that this is only an example config. Depending on the use-cases, the config required might be more than what’s documented above.

All parsing config must be specified under this key:

kf_parsing_config:

By default the config is written as a yaml file in conf directory. However the directory name can be overriden by specifying configPath:

kf_parsing_config: configPath: "<CUSTOM_CONFIG_DIR>"

 

All the different functions must be specified in the config section. Here’s the anatomy of the config section:

Parser Patterns

Kloudfuse stack supports defining grammar either as a dissect or grok patterns (see Grammar section below). If you want to define a pattern definition and use it across multiple patterns, then you can define them here. Consider the sample config from example above:

We’ve defined 2 patterns to be used with grammar - one for dissect and another for grok. These definitions can be used in other parts of the config by referring to them using the key (timestamp_pat or NGINX_HOST).

Functions

Each function must be specified as a separate item in the list. This should be followed by the required arguments for the function. The next section goes through each function in the logs parsing pipeline and describe various supported functions and the required arguments.

Each function can optionally have a list of conditions. These conditions determine whether the function should be executed for a given log line or not. Every condition is triplet of a matcher, value and an operator (op).

If a function has no condition(s) attached to it, then it’ll always be executed.

matcher can be any one of these:

  • Label - name must be prefixed with #

  • Log Facet - name must be prefixed with @

  • Field - name must be prefixed with %. Currently, the only supported field is kf_msg.

  • JSONPath - field from incoming JSON payload specified as JSONPath. This is supported only with remap function.

  • Static string literal __kf_agent - to specify log agent type. This is supported only with remap function.

value must be a literal string value or a regex value (depending on the operator).

op is the operator to use for the condition. Currently the following operator types are supported:

  • == (String equality)

  • != (String inequality)

  • =~ (Regex match)

  • !~ (Regex not match)

  • contains

  • startsWith

  • endsWith

  • in (value must be a comma separated string value).

All the conditions in the list of conditions are treated as conjunction (&&) and the final value is used to determine whether a function must be executed or not.

Log Parsing Pipeline and Functions

Here’s what a typical log parsing pipeline looks like in Kloudfuse stack:

Log lines are ingested into the stack through ingester service. This service writes the incoming log payloads to a kafka topic. Each payload is then read by a service called logs-parser, and unmarshalled into a JSON string. The unmarshalled JSON string is fed into the pipeline as a input, and every log line will go through the pipeline.

Remap

We support logs ingest from various agents (datadog/fluent-bit/fluentd/kinesis/GCP Pubsub/kinesis/otlp/filebeat) and in various formats (JSON/msgpack/proto). remap stage runs as the first step in pipeline. It is responsible for mapping fields from the incoming payload to a field in a internal representation of the log event (ParseBuilder). If the log payload is in msgpack or proto, then they’re first converted to JSON format before running this stage. We’ll extract the following fields from the log payload:

  • Log message

  • Timestamp

  • Log source

  • Labels/Tags

  • Log Facets

This stage is fully configurable - meaning you can instruct how to map fields from the incoming log payload to a field in ParseBuilder. Here’s a what a sample remap function looks like:

Note that the code snippet above is an exhaustive list, and covers all possible arguments supported for remap function. However, most of these fields ship with a good set of defaults. So, you don’t have to define all the arguments to override, unless you’re deviating from default configuration. For instance, fluent-bit agent includes the log line in a field called log in the payload. If you have a modify filter to change this field from log to logLine, then you would need to configure remap as:

All the fields must be specified in JSONPath notation.

__kf_agent is a special/reserved matcher value, and these are the currently supported agent values:

  • datadog

  • fluent-bit

  • fluentd

  • kinesis

  • gcp

  • otlp

  • filebeat

The other matcher template that is unique to remap function is the JSONPath notation. All the other functions in the pipeline do *not* support JSONPath notation.

Relabel

This stage in the pipeline operates on labels/tags extracted from the previous stage. In this stage you can:

  • Add/Drop/Replace a label

  • Create a new label by combining various label value(s)

  • Keep/Drop a log event matching label value(s).

Relabel stage follows the same syntax and semantics as Prom relabeling. You can refer to its documentation here.

Here’s a sample config for relabel function:

PreParse

This part of the pipeline is not configurable, and internal to Kloudfuse stack. One of the main functions which we run as part of this pipeline is to truncate any log messages whose size is greater than 64KB.

Grammar

Kloudfuse currently autodetects log facets and timestamp from the input log line. However, this is a heuristic based approach, and the extracted facets may not be accurate all the time. Users can define grammars to facilitate log facet extraction. Any user defined grammar in the config is applied in this stage of the pipeline.

Kloudfuse supports defining grammar in 2 different ways:

  • dissect patterns - simple text based tokenizer. See here for more info about dissect patterns. Use this debugger to test out dissect patterns, before importing them into logs-parser config.

  • grok patterns - based on regexes (named regexes). See here or here for more info about grok patterns. Use this debugger to test grok patterns before importing them.

Here’s a sample grammar config (with both dissect and grok patterns):

KfParse

This part of the pipeline is also not configurable. This part of the pipeline is responsible for auto-facet detection and generating fingerprint for a given log line.

Transform

This is the last function of the pipeline. This function’s syntax is similar to relabel function. So, you can do all the functionality available in relabel function, but also derive labels from log facets. For instance, if you want to add value from a log facet called eventSource as a label value with label name as source, the config would look like:

Kafka write

At this point, the log line has gone through the entire pipeline and all the facets and labels have been extracted. At this point ParseBuilder object fields are marshaled into a proto object and written out to kafka.