Logs parsing config
Typically, as the logs are ingested into the Kloudfuse stack, they go through a log pipeline which apply various rules (such as grammar) or extract log facets and labels/tags automatically from an input log line.
Parsing configuration example
Most of the phases in the logs pipeline are configurable, and you can configure them in your custom values.yaml
file. Here’s what a sample configuration looks like:
logs-parser:
kf_parsing_config:
configPath: "/conf"
config: |-
parser_patterns:
- dissect:
timestamp_pat: "<REGEX>"
- grok:
NGINX_HOST: (?:%{IP:destination_ip}|%{NGINX_NOTSEPARATOR:destination_domain})(:%{NUMBER:destination_port})?
- remap:
args:
kf_source:
- "$.logSource"
conditions:
- matcher: "__kf_agent"
value: "fluent-bit"
op: "=="
- parser:
dissect:
args:
- tokenizer: '%{timestamp} %{level} [LLRealtimeSegmentDataManager_%{segment_name}]'
conditions:
- matcher: "%kf_msg"
value: "LLRealtimeSegmentDataManager_"
op: "contains"
Note that this is only an example config. Depending on the use-cases, the config required might be more than what’s documented above.
All parsing config must be specified under this key:
kf_parsing_config:
By default the config is written as a yaml file in conf
directory. However the directory name can be overriden by specifying configPath
:
kf_parsing_config:
configPath: "<CUSTOM_CONFIG_DIR>"
All the different functions must be specified in the config
section. Here’s the anatomy of the config section:
Parser Patterns
Kloudfuse stack supports defining grammar either as a dissect
or grok
patterns (see Grammar section below). If you want to define a pattern definition and use it across multiple patterns, then you can define them here. Consider the sample config from example above:
We’ve defined 2 patterns to be used with grammar - one for dissect
and another for grok
. These definitions can be used in other parts of the config by referring to them using the key (timestamp_pat
or NGINX_HOST
).
Functions
Each function must be specified as a separate item in the list. This should be followed by the required arguments for the function. The next section goes through each function in the logs parsing pipeline and describe various supported functions and the required arguments.
Each function can optionally have a list of conditions. These conditions determine whether the function should be executed for a given log line or not. Every condition is triplet of a matcher
, value
and an operator (op
).
If a function has no condition(s) attached to it, then it’ll always be executed.
matcher
can be any one of these:
Label - name must be prefixed with
#
Log Facet - name must be prefixed with
@
Field - name must be prefixed with
%
. Currently, the only supported field iskf_msg
.JSONPath - field from incoming JSON payload specified as JSONPath. This is supported only with remap function.
Static string literal
__kf_agent
- to specify log agent type. This is supported only with remap function.
value
must be a literal string value or a regex value (depending on the operator).
op
is the operator to use for the condition. Currently the following operator types are supported:
==
(String equality)!=
(String inequality)=~
(Regex match)!~
(Regex not match)contains
startsWith
endsWith
in
(value
must be a comma separated string value).
All the conditions in the list of conditions
are treated as conjunction (&&
) and the final value is used to determine whether a function must be executed or not.
Log Parsing Pipeline and Functions
Here’s what a typical log parsing pipeline looks like in Kloudfuse stack:
Log lines are ingested into the stack through ingester service. This service writes the incoming log payloads to a kafka topic. Each payload is then read by a service called logs-parser
, and unmarshalled into a JSON string. The unmarshalled JSON string is fed into the pipeline as a input, and every log line will go through the pipeline.
Remap
We support logs ingest from various agents (datadog
/fluent-bit
/fluentd
/kinesis
/GCP Pubsub
/kinesis
/otlp
/filebeat
) and in various formats (JSON/msgpack/proto). remap
stage runs as the first step in pipeline. It is responsible for mapping fields from the incoming payload to a field in a internal representation of the log event (ParseBuilder
). If the log payload is in msgpack or proto, then they’re first converted to JSON format before running this stage. We’ll extract the following fields from the log payload:
Log message
Timestamp
Log source
Labels/Tags
Log Facets
This stage is fully configurable - meaning you can instruct how to map fields from the incoming log payload to a field in ParseBuilder
. Here’s a what a sample remap
function looks like:
Note that the code snippet above is an exhaustive list, and covers all possible arguments supported for remap
function. However, most of these fields ship with a good set of defaults. So, you don’t have to define all the arguments to override, unless you’re deviating from default configuration. For instance, fluent-bit agent includes the log line in a field called log
in the payload. If you have a modify
filter to change this field from log
to logLine
, then you would need to configure remap
as:
All the fields must be specified in JSONPath notation.
__kf_agent
is a special/reserved matcher value, and these are the currently supported agent value
s:
datadog
fluent-bit
fluentd
kinesis
gcp
otlp
filebeat
The other matcher
template that is unique to remap function is the JSONPath notation. All the other functions in the pipeline do *not* support JSONPath notation.
Relabel
This stage in the pipeline operates on labels/tags extracted from the previous stage. In this stage you can:
Add/Drop/Replace a label
Create a new label by combining various label value(s)
Keep/Drop a log event matching label value(s).
Relabel stage follows the same syntax and semantics as Prom relabeling. You can refer to its documentation here.
Here’s a sample config for relabel
function:
PreParse
This part of the pipeline is not configurable, and internal to Kloudfuse stack. One of the main functions which we run as part of this pipeline is to truncate any log messages whose size is greater than 64KB.
Grammar
Kloudfuse currently autodetects log facets and timestamp from the input log line. However, this is a heuristic based approach, and the extracted facets may not be accurate all the time. Users can define grammars to facilitate log facet extraction. Any user defined grammar in the config is applied in this stage of the pipeline.
Kloudfuse supports defining grammar in 2 different ways:
dissect
patterns - simple text based tokenizer. See here for more info aboutdissect
patterns. Use this debugger to test out dissect patterns, before importing them into logs-parser config.grok
patterns - based on regexes (named regexes). See here or here for more info aboutgrok
patterns. Use this debugger to test grok patterns before importing them.
Here’s a sample grammar config (with both dissect
and grok
patterns):
KfParse
This part of the pipeline is also not configurable. This part of the pipeline is responsible for auto-facet detection and generating fingerprint for a given log line.
Transform
This is the last function of the pipeline. This function’s syntax is similar to relabel
function. So, you can do all the functionality available in relabel
function, but also derive labels from log facets. For instance, if you want to add value from a log facet called eventSource
as a label value with label name as source
, the config would look like:
Kafka write
At this point, the log line has gone through the entire pipeline and all the facets and labels have been extracted. At this point ParseBuilder
object fields are marshaled into a proto object and written out to kafka.