diff --git a/src/content/docs/reference/operators.mdx b/src/content/docs/reference/operators.mdx index 15c5042e6..0a76d91d4 100644 --- a/src/content/docs/reference/operators.mdx +++ b/src/content/docs/reference/operators.mdx @@ -515,6 +515,10 @@ operators: description: 'Parses an incoming bytes stream into a single event.' example: 'read_all binary=true' path: 'reference/operators/read_all' + - name: 'read_auto' + description: 'Detects the input format of a byte stream and selects a matching reader.' + example: 'read_auto fallback="lines"' + path: 'reference/operators/read_auto' - name: 'read_bitz' description: 'Parses bytes as *BITZ* format.' example: 'read_bitz' @@ -2157,6 +2161,14 @@ read_all binary=true + + +```tql +read_auto fallback="lines" +``` + + + ```tql diff --git a/src/content/docs/reference/operators/read_auto.mdx b/src/content/docs/reference/operators/read_auto.mdx new file mode 100644 index 000000000..b8d3617f1 --- /dev/null +++ b/src/content/docs/reference/operators/read_auto.mdx @@ -0,0 +1,116 @@ +--- +title: read_auto +category: Parsing +example: 'read_auto fallback="lines"' +--- + +Detects the input format of a byte stream and selects a matching reader. + +```tql +read_auto [fallback=string, max_probe_bytes=uint] +``` + +## Description + +The `read_auto` operator probes the first bytes of its input and starts the +reader whose detector returns the best unique match. Use it when the input format +is unknown at authoring time, but should still be one of Tenzir's structured +formats. + +By default, detection is strict. If no detector matches, or if multiple +detectors match with the same score, `read_auto` emits an error instead of +falling back to a generic text reader. + +The built-in detectors cover common JSON, delimited text, security log, and +magic-byte formats, including NDJSON, JSON objects, JSON arrays of objects, CSV, +TSV, SSV, key-value text, YAML, Syslog, CEF, LEEF, Zeek TSV, Suricata EVE JSON, +Zeek JSON, GELF, PCAP, Feather, BITZ, and Parquet. + +### `fallback = string (optional)` + +Controls what happens when no detector matches. + +Valid values are: + +- `"none"`: Emit an error. This is the default. +- `"lines"`: Use read_lines. The input must be valid UTF-8. +- `"all"`: Use read_all. `read_auto` uses the current probe to + choose between text and binary output: valid UTF-8 probe bytes select + `read_all`, while invalid probe bytes select `read_all binary=true`. If + binary input can start with a valid UTF-8 prefix longer than + `max_probe_bytes`, use a larger probe limit or read_all with + `binary=true` directly. + +### `max_probe_bytes = uint (optional)` + +The maximum number of bytes to inspect before forcing a detection decision. + +Defaults to `1Mi` bytes. + +## Examples + +### Detect JSON lines + +Given this input: + +```json title="events.ndjson" +{"x":1} +{"x":2} +``` + +Use `read_auto` where you would normally use a concrete reader: + +```tql +from_file "events.ndjson" { + read_auto +} +``` + +```tql +{x: 1} +{x: 2} +``` + +### Fall back to lines + +For arbitrary UTF-8 text, opt into line-based parsing explicitly: + +```txt title="messages.txt" +hello +world +``` + +```tql +from_file "messages.txt" { + read_auto fallback="lines" +} +``` + +```tql +{line: "hello"} +{line: "world"} +``` + +### Fall back to a single event + +Use `fallback="all"` when unknown input should become one event instead of one +event per line: + +```tql +from_file "payload.bin" { + read_auto fallback="all" +} +``` + +If the input is binary, the resulting event contains a `blob` value in the +`data` field. + +## See Also + +- read_all +- read_csv +- read_json +- read_lines +- read_ndjson +- read_syslog +- read_yaml