A Query.Farm VGI worker for DuckDB.
vgi-xslt · a Query.Farm VGI worker · powered by SaxonC-HE
A VGI worker that brings XSLT 3.0 / XQuery 3.1 /
XPath 3.1 into DuckDB/SQL. Transform, query, and shred XML documents from
plain SQL — extract values, evaluate XPath, run XQuery FLWORs, and explode one
document into many rows — backed by SaxonC-HE via
the saxonche Python bindings.
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'xslt' (TYPE vgi, LOCATION 'uv run xslt_worker.py');
SELECT xslt.xpath_string('<r><a>x</a></r>', '//a'); -- 'x'
SELECT xslt.xpath_boolean('<r><a/></r>', 'count(//a) = 1'); -- true
SELECT xslt.xpath_number('<r><n>42</n></r>', 'number(//n)'); -- 42.0
SELECT UNNEST(xslt.xpath_array('<r><i>a</i><i>b</i></r>', '//i')); -- 'a','b' ← shred!
SELECT xslt.xquery('<r><i>a</i><i>b</i></r>', 'string-join(//i, ",")'); -- 'a,b'
SELECT xslt.is_well_formed('<a></a>'); -- true
SELECT * FROM xslt.xpath_nodes('<r><i>a</i><i>b</i></r>', '//i'); -- (seq, value) rows
SELECT * FROM xslt.saxon_version();The headline is xpath_array + UNNEST: it turns a column of XML documents
into shredded rows, one per matching node, in a single SQL statement:
SELECT id, UNNEST(xpath_array(doc, '//item')) AS item
FROM documents;-
Scalars take positional arguments only (DuckDB's
name := valuesyntax is a table-function/macro feature, not a scalar one). Every per-row answer is a scalar, so it works inline in any projection or predicate. All scalars here are a fixed shape —(xml)or(xml, expr)— so there are no arity overloads. -
Table functions expand one document into many rows and accept named arguments:
xpath_nodes(xml, expr),xquery_rows(xml, query), and the no-arg discoverysaxon_version().
NULL semantics. A NULL in any scalar argument yields NULL output for that
row; a table function over a NULL document yields no rows. Invalid input —
malformed XML, a bad stylesheet/query, or a bad XPath expression — raises a
clean DuckDB error (never a worker crash), except is_well_formed, which
reports malformed XML as false rather than erroring.
| Function | Form | Signature | Returns |
|---|---|---|---|
xslt |
scalar | (xml, stylesheet) |
VARCHAR (serialized result) |
xpath_string |
scalar | (xml, expr) |
VARCHAR (first match, NULL if none) |
xpath_boolean |
scalar | (xml, expr) |
BOOLEAN (effective boolean value) |
xpath_number |
scalar | (xml, expr) |
DOUBLE (first match, NULL if non-numeric) |
xpath_array |
scalar | (xml, expr) |
VARCHAR[] (string value of all matches) |
xquery |
scalar | (xml, query) |
VARCHAR (serialized result) |
is_well_formed |
scalar | (xml) |
BOOLEAN (false if malformed, never errors) |
xpath_nodes |
table | (xml, expr) |
(seq BIGINT, value VARCHAR) |
xquery_rows |
table | (xml, query) |
(seq BIGINT, value VARCHAR) |
saxon_version |
table | () |
(version VARCHAR) |
xslt(xml, stylesheet) compiles the stylesheet (XSLT 3.0) and applies it to the
document, returning the serialized result. A constant stylesheet applied
down a column compiles once and transforms many rows (compiled executables
are cached by their source text).
SELECT id, xslt(doc, (SELECT stylesheet FROM my_xslt)) AS html
FROM documents;xpath_string / xpath_boolean / xpath_number evaluate an XPath 3.1
expression and return the first match's string value, the expression's effective
boolean value, or the first match's numeric value respectively. xpath_array
returns the string value of every match as a VARCHAR[] — UNNEST it to
shred a document into rows.
SELECT xpath_string(doc, '//title') AS title,
xpath_boolean(doc, 'exists(//published)') AS is_published,
xpath_number(doc, 'count(//comment)') AS num_comments
FROM articles;Namespaces. XPath expressions are evaluated by Saxon over the document's
in-scope namespaces. The most portable way to match namespaced elements without
binding a prefix is the wildcard-namespace test //*:item or a
local-name() = 'item' predicate (both shown in the test suite). Binding a
specific prefix inside a single SQL string is limited by SaxonC's expression
context; prefer wildcard / local-name() where you can't declare a prefix.
xquery(xml, query) runs an XQuery 3.1 query with the document as the context
item and returns the serialized result; xquery_rows(xml, query) explodes the
result sequence into (seq, value) rows.
SELECT * FROM xquery_rows(
'<order><line>1</line><line>2</line><line>3</line></order>',
'for $l in //line return $l * 10'
); -- (1,'10'), (2,'20'), (3,'30')Two complementary tools:
-
xpath_array+UNNEST— shred a column of documents (one expression applied per row). This is the headline:SELECT id, UNNEST(xpath_array(doc, '//item')) AS item FROM documents;
-
xpath_nodes(xml, expr)/xquery_rows(xml, query)— explode a single document into rows with a 1-basedseq:SELECT seq, value FROM xpath_nodes('<r><i>a</i><i>b</i></r>', '//i');
| Component | License | Notes |
|---|---|---|
vgi-xslt (this worker) |
MIT | This repository's own code. |
saxonche (SaxonC-HE 13.0) |
Mozilla Public License 2.0 | The free/open-source Home Edition of Saxon. |
vgi-python |
Query Farm Source-Available | The VGI SDK. |
This worker uses SaxonC-HE — the Home Edition, distributed under the
Mozilla Public License 2.0 (the free/open-source Saxon edition). The
saxonche wheel on PyPI ships SaxonC-HE 13.0. The processor is created with
license=False, so no Saxon licence file is required and no commercial Saxon
key is involved.
No XSD / schema awareness. SaxonC-HE is not schema-aware: there is no XSD validation and no schema-aware XSLT/XQuery. This worker therefore exposes no
xsd_validatefunction —is_well_formedchecks well-formedness (does it parse), which is the strongest XML check HE can offer. Saxon's commercial PE/EE editions add schema awareness; this worker deliberately stays on the free HE edition.
MPL-2.0 is a permissive, file-level copyleft licence: using saxonche as an
unmodified, separately pip-installed dependency keeps vgi-xslt's own code
MIT and fine for commercial use.
uv sync --all-extras # create .venv with vgi-python + saxonche + dev tools
make test # pytest (unit + integration) + SQL end-to-end
make test-unit # pytest only
make test-sql # DuckDB sqllogictest files via haybarn-unittest
uv run ruff check . # lint
uv run mypy vgi_xslt/tests/test_engine.py covers the pure Saxon primitives directly (malformed XML,
empty strings, bad XPath, no-match, namespaces, a real XSLT identity +
value-extract transform, an XQuery FLWOR sequence); tests/test_tables.py
drives the table functions through the real bind→init→process lifecycle
in-process; tests/test_scalars.py spawns xslt_worker.py over the VGI
client/RPC stack exactly as DuckDB would after ATTACH. The test/sql/*.test
files are DuckDB sqllogictest cases run by
haybarn-unittest
(uv tool install haybarn-unittest) against a real ATTACH + SELECT.
xslt_worker.py entry point; assembles the `xslt` catalog (inline uv script metadata)
Makefile test / test-unit / test-sql targets
vgi_xslt/
engine.py pure Saxon primitives: singleton processor + compile caches (only saxonche importer)
scalars.py per-row scalars (xslt, xpath_*, xquery, is_well_formed)
tables.py table functions: xpath_nodes, xquery_rows, saxon_version
schema_utils.py Arrow field/comment helper
tests/
harness.py in-process bind→init→process driver
test_engine.py pure-logic unit + error/edge tests
test_tables.py table-function integration tests
test_scalars.py per-row scalars via vgi.client.Client
test/sql/
*.test DuckDB sqllogictest end-to-end cases (haybarn-unittest)
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
