This repo provides:
ai-detector: 3-expert ensemble classification with MELD, TMR, and MAGE (ModernBERT fromraid_model).ai-detector-heuristic: fast local rule-based scoring for triage before running the heavier model ensemble.ai-detector-calibrate: grid-search operating-point calibration from stored evaluation scores.personal-style-pl: Polish personal writing-style similarity + an interpretable, OOD-aware AI-leaning overlay (ai-markers).
Full env (detector + style + the papuGaPT2 perplexity model): ./scripts/setup_style_env.sh.
Detector only: python -m pip install -e ..
# A. Fast AI-text triage (no model weights). --rich adds the interpretable
# rich-metric + AI-leaning block (needs the style extras).
ai-detector-heuristic --text-file examples/candidates/draft_b.txt --json --rich
# B. "Is this AI, or my own human style?" — score a candidate against a KNOWN-HUMAN
# baseline you build once from reference samples (a folder of .txt files):
personal-style-pl build-profile --samples-dir examples/my_style_samples \
--output artifacts/human.joblib --no-stylometrix
personal-style-pl ai-markers --text-file examples/candidates/draft_b.txt \
--profile artifacts/human.joblib --json
# (append --with-perplexity to BOTH commands for the papuGaPT2 perplexity signal)Read ai_leaning_score (0–100, vs your baseline), each marker's leaning + counted, and
(with perplexity) perplexity_flag. Advisory triage, not proof of authorship; it abstains on
Polish (ood_or_unreliable: true, low confidence). Details: detector in §1–11, style +
AI-vs-human in §12 ("Judging AI-generated vs human text").
python3 -m venv .venv
. .venv/bin/activate
python -m pip install -e .Then use the installed entrypoints:
ai-detector --help
ai-detector-heuristic --help
ai-detector-calibrate --help
ai-detector-deploy --helpOn externally managed Python installs, keep the editable install inside a virtual
environment. For an existing environment where dependencies are already present,
use .venv/bin/python -m pip install -e . --no-deps.
- Current contract:
meld,tmr,raidexperts are combined as a weighted ensemble from active experts only. --weightsacceptsmeld,tmr(mapped tomeld,tmr,0) ormeld,tmr,raid.- In local practice,
tmr-only runs are the reference for Polish/OOD quick checks because sample outputs are stored indata/evaluation/*. - The default runtime profile is
pl-technical-ood, a packaged hybrid operating point for this PL/OOD use case. It uses model weightsmeld,tmr,raid = 0.75,0.00,0.25, blends in the fast heuristic withheuristic_weight = 0.60, and setsthreshold = 0.434552. - Use
--profile legacy-defaultto run the old raw default weights0.34,0.33,0.33. - TMR and RAID catch synthetic AI controls, but both over-call many Polish
pre-2020 technical papers at the default
0.5threshold; do not use TMR-only as a reliability proxy for Polish OOD text. - The heuristic CLI is separate and model-free. Use it for cheap triage; use
ai-detectorfor detailed model-based analysis.
Runtime model weights are large Hugging Face artifacts, so they are not vendored
inside the Python wheel. The package declares the code dependencies in
pyproject.toml; ai-detector-deploy materializes the model data dependency
into the local runtime folders.
Fresh checkout deployment:
ai-detector-deploy --allThis downloads the packaged model set:
| expert | Hugging Face model | local folder |
|---|---|---|
| MELD | anon-review-meld-2026/meld |
./meld_model |
| TMR | Oxidane/tmr-ai-text-detector |
./tmr_model |
| MAGE/RAID | GeorgeDrayson/modernbert-ai-detection-raid-mage |
./raid_model |
List the packaged sources without downloading:
ai-detector-deploy --list-modelsFor reproducible deployment, pin immutable revisions per model:
ai-detector-deploy \
--all \
--meld-revision <immutable_meld_commit_or_tag> \
--tmr-revision <immutable_tmr_commit_or_tag> \
--raid-revision <immutable_raid_commit_or_tag>You can still pull a single custom Hugging Face model into a local folder:
ai-detector-deploy \
--model-id anon-review-meld-2026/meld \
--revision <immutable_meld_commit_or_tag> \
--target-dir ./meld_model
ai-detector-deploy \
--model-id Oxidane/tmr-ai-text-detector \
--revision <immutable_tmr_commit_or_tag> \
--target-dir ./tmr_model
ai-detector-deploy \
--model-id GeorgeDrayson/modernbert-ai-detection-raid-mage \
--revision <immutable_raid_commit_or_tag> \
--target-dir ./raid_modelDefaults and behavior:
ai-detector-deploydefault target is./meld_model(override with--target-dir).--revisiondefaults tomain; with--all, it is the shared fallback revision unless--meld-revision,--tmr-revision, or--raid-revisionis provided. Treatmainas a mutable pointer. For reproducible deployments, use an immutable revision (commit SHA or tag) from the model page under Commits/Versions.- Re-running the same command is idempotent; unchanged files are skipped.
- A manifest is written as
ai_detector_model_manifest.jsonin each target folder. It stores deployment metadata includingmodel_id, the requestedrevision,fetched_at, and other model metadata returned by Hugging Face. If you passmain, the manifest recordsmain; use an immutable commit SHA or tag when the manifest must identify a reproducible model version directly.
Expected local directories for inference:
./meld_model./tmr_model./raid_model
Checkout fallback: use python3 deploy_meld.py ... instead of
ai-detector-deploy ... when the package has not been installed yet.
MELD note: ./meld_model contains the MELD head and tokenizer. Its
meld_config.json points the backbone to jhu-clsp/ettin-encoder-400m, so
--local-files-only also requires that backbone to be present in the local
Hugging Face cache. Run once online without --local-files-only, or pre-cache
that backbone, before expecting fully offline MELD inference.
# default profile: pl-technical-ood
ai-detector --text "The sky is blue and clear."
# input from stdin (pipe)
printf "The sky is blue and clear." | ai-detector --json
# input from file
ai-detector --text-file ./input.txt --json
# explicit weights and chunk controls
ai-detector \
--weights 0.34,0.33,0.33 \
--max-chunks 4 \
--batch-size 4 \
--overlap 64 \
--text "The sky is blue and clear."
# zero-weight expert skip
ai-detector --weights 1.0,0.0,0.0 --text "The sky is blue and clear." --json
# explicit Polish/OOD profile
ai-detector \
--profile pl-technical-ood \
--text-file ./input.txt \
--json
# legacy raw default weights
ai-detector --profile legacy-default --weights 0.34,0.33,0.33 --text-file ./input.txt --jsonNotes:
--weightsaccepts two or three values:meld,tmr,(raid).--batch-sizecontrols chunk batch size for scoring.--max-chunkscaps the number of chunks scored per expert.--thresholdcompares only the final ensemble value.--profileselects a packaged operating profile; the default ispl-technical-ood.--calibration-filesupplies default weights, threshold, and optional heuristic blend weight from an external JSON operating-point calibration unless the corresponding CLI options are explicitly passed.--heuristic-weightblends the heuristic score into the final ensemble; the PL/OOD profile sets this automatically.--text,--text-file, stdin pipe, and--jsonoutput mode are supported.--quietsuppresses third-party stderr chatter during model loading/scoring; user-facingError: ...messages are still printed on failure.
The heuristic path is a local Python port of the reviewed browser-side detector style: sentence variation, entropy, n-gram repetition, punctuation/list markers, Polish/English AI phrase markers, and AI-style word stems. It does not load model weights and is intended as a red-flag signal only.
# direct text
ai-detector-heuristic --text "Warto zauważyć, że to krótki test operatora z pełnym zdaniem."
# stdin pipe
printf "Warto zauważyć, że w dzisiejszym świecie AI odgrywa kluczową rolę.\n" \
| ai-detector-heuristic --json
# file input
ai-detector-heuristic --text-file ./input.txt --json
# optional: attach the interpretable rich-metric block + AI-leaning overlay
# (lazy-imports personal_style_pl; the base path keeps zero extra dependencies)
ai-detector-heuristic --text-file ./input.txt --json --richWith --rich, the payload gains experts.heuristic.rich_metrics (MATTR/MTLD,
burstiness, length-normalized phrase densities) and experts.heuristic.ai_leaning
(the AI-direction overlay; abstains on Polish). See §12 for the metrics and the
interpretation/calibration. Without --rich the lean detector is unchanged.
Heuristic JSON uses the same top-level operator sections:
experts.heuristic.ai_probability,human_probability,site_metrics,categories,signals,metricsensemble.ai_probability,ensemble.human_probability,ensemble.threshold,ensemble.labelcalibration.status = uncalibrated_heuristic
Operator fields:
experts.heuristic.site_metrics.*.ai_probability_percentmirrors the site UI category bars:Zmienność tekstu,Słownictwo,Entropia,Powtarzalność,Sygnatury AI,Struktura.experts.heuristic.categories.*.scoreis the internal human-like score used to compute those bars; the site-style AI percentage is100 - score.experts.heuristic.signalslists detected AI phrases, AI-style words, and em/en dash count for quick review.
Parity smoke against the reviewed web-detector behavior:
- Polish marker-heavy sample:
ai_probability = 0.99 - plain Polish narrative sample:
ai_probability = 0.29 - English marker-heavy sample:
ai_probability = 0.99 - very short text is rejected before scoring
These are raw uncalibrated rule scores, not proof of authorship.
The initial six-sample smoke showed moderate heuristic-vs-ensemble agreement
(Pearson=0.690, Spearman=0.771), but the broader pre-2020 Polish technical
paper set showed weak or negative correlation with model scores. Treat the
heuristic as a separate triage signal, not a model-score substitute.
JSON fields used by operators:
weightsexperts.meld,experts.tmr,experts.raideach with:ai_score,human_score,ai_probability,human_probabilitywhen loadedloaded,chunks, optionalnotes
ensemble.ai_probability,ensemble.human_probability,ensemble.threshold,ensemble.labelcalibration.status,calibration.calibrated,calibration.message
Decision logic:
ensemble.model_ai_probabilityis the weighted average of loaded model experts.ensemble.ai_probabilityisensemble.model_ai_probabilityunless a non-zeroheuristic_weightis active; with heuristic blending it is(1 - heuristic_weight) * model_ai_probability + heuristic_weight * heuristic_ai_probability.ensemble.labelisaiifensemble.ai_probability >= threshold, elsehuman.ensemble.human_probabilityis1 - ensemble.ai_probability.weightsreports effective final weights. With the default PL/OOD profile this ismeld=0.30,tmr=0.00,raid=0.10,heuristic=0.60.
Scoring details:
- For loaded experts,
human_probability = 1 - ai_probability. - For skipped experts (
loaded: false, typically due to zero weight), all score/probability fields arenullandnotesexplains the skip. - Without calibration configuration, all probabilities are raw uncalibrated scores.
- With
--profile pl-technical-oodor--calibration-file,calibration.statusbecomesoperating_point_calibrated. This means weights/threshold were selected from local validation data; individual expert probabilities are still raw scores, not probability-calibrated estimates.
Privacy note:
text_previewin JSON output is the first 250 characters of raw input text.- This preview is emitted as part of CLI output and can be copied into logs by operators; avoid sending sensitive or regulated text through the same output channel unless log redaction is in place.
- Keep
text_previewas a debug aid only; it is not a sanitized/safe-to-keep artifact when input confidentiality is required.
See data/evaluation/README.md for the local
sample inventory, source notes, commands, and interpretation.
The Polish snapshots below are tmr-only outputs (weights: 0,1,0):
tmr_ai_sample_result.json
ai_probability: 0.9631ensemble.label:ai
tmr_human_author_result.json
ai_probability: 0.0740ensemble.label:human
tmr_human_articles_result.json
ai_probability: 0.6433ensemble.label:ai
Interpretation:
- these are local checks, not production benchmarks;
- they show a calibration/threshold gap on OOD-like chemistry text (
human_articlesis a false-positive AI label); - this is intentionally a negative control and should not be interpreted as a successful detection.
- negative or counterintuitive cases should be kept visible for tuning decisions.
Broader PL/OOD smoke:
- Source fixture:
data/evaluation/polish_pre2020_technical_papers/. - Baseline result:
data/evaluation/polish_pre2020_technical_papers/broad_eval_2026-05-24.json. - Calibration:
data/evaluation/polish_pre2020_technical_papers/calibration_pl_ood_2026-05-24.json. - Source cache: original PDFs and clean extracted full text under
data/evaluation/polish_pre2020_technical_papers/sources/, with SHA-256 checksums insources/manifest.json. - Corpus: 12 pre-2020 technical/scientific papers; 11 clean Polish entries used for fitting, one mixed English/Polish entry kept for audit only.
- Default ensemble (
0.34,0.33,0.33) on long 700-word human excerpts:2/11false positives at threshold0.5,6/11at threshold0.4. - Source-group split hybrid profile
0.75,0.00,0.25plusheuristic_weight=0.60, threshold0.434552: heldout window smoke0/12false positives,0/3false negatives; all window records0/33false positives,0/8false negatives, margin0.095468.
- PL language support is limited; model quality can drop for Polish text.
- OOD (out-of-distribution) inputs are not guaranteed and can produce unstable scores. Thresholds should be adjusted only after local validation.
- Zero-weight models are not loaded to avoid unnecessary compute.
- The heuristic CLI can false-positive formal Polish prose with common AI-like markers and false-negative edited AI text without those markers.
- Heuristic scores are useful for fast triage, not enforcement or final labeling.
Quick checks:
ai-detector --help
ai-detector-deploy --help
ai-detector-deploy --list-models
ai-detector-heuristic --help
ai-detector-calibrate --help
ai-detector --text "quick smoke test with enough words for heuristic profile metadata" --json
printf "quick smoke test with enough words for heuristic profile metadata\n" > /tmp/ai_detector_input.txt
ai-detector --text-file /tmp/ai_detector_input.txt --json
printf "quick smoke test with enough words for heuristic profile metadata\n" | ai-detector --json
ai-detector-heuristic --text "Warto zauważyć, że to szybki test lokalny z pełnym zdaniem operatora." --json
ai-detector-calibrate \
--baseline-result data/evaluation/polish_pre2020_technical_papers/broad_eval_windows_2026-05-24.json \
--output /tmp/ai_detector_calibration.json \
--id pl_technical_ood_2026_05_24 \
--grid-step 0.05 \
--fixed-weights 0.75,0,0.25 \
--fit-heuristic-weight \
--heuristic-grid-step 0.05 \
--max-heuristic-weight 0.60 \
--split-seed pl-technical-ood-v2Heavy smoke test:
printf "This is a longer validation snippet for smoke testing the ensemble path.\n" > /tmp/ai_detector_smoke.txt
ai-detector \
--text-file /tmp/ai_detector_smoke.txt \
--weights 0.34,0.33,0.33 \
--max-chunks 4 \
--batch-size 4 \
--overlap 64 \
--quiet \
--json > /tmp/ai_detector_smoke.json
python3 - <<'PY'
import json
result = json.load(open("/tmp/ai_detector_smoke.json", "r", encoding="utf-8"))
for key in ("experts", "ensemble", "calibration"):
assert key in result, f"Missing section: {key}"
for expert in ("meld", "tmr", "raid"):
assert expert in result["experts"], f"Missing expert: {expert}"
print("ok")
PYExpected smoke test outcome:
- command exits 0
- JSON includes
experts.*,ensemble,calibration,weights,device ensemble.labelisaiorhuman
Note: full smoke test requires runtime dependencies (including torch), so it is
recommended to run these commands after activating the project virtualenv. From a
checkout without package entrypoints, replace ai-detector with
python3 run_ensemble.py.
.github/workflows/ci.ymlruns on pull requests and pushes tomain. It checks entrypoint help, syntax, evaluation JSON artifacts, calibration regeneration,pytest, package build, andtwine check..github/workflows/release.ymlbuilds source/wheel distributions on manual dispatch and on GitHub Release publish, then runstwine check. Release publish also attempts PyPI trusted publishing through thepypienvironment..github/workflows/runtime-smoke.ymlis scheduled/manual and downloads cached Hugging Face model artifacts before running a real calibrated inference smoke.
Use the daemon when you send repeated local scoring requests. It keeps models preloaded and scores JSONL requests sequentially in one process, so you avoid CLI cold-start/model-load overhead.
Recommended launch on this CPU (P-core pinning):
OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 taskset -c 0-7 ai-detector-daemon --local-files-only --threads 4 --device cpu --quiet--device cpuis explicit but optional because the daemon default is CPU.- Use
--device autoonly when doing direct accelerator-vs-CPU comparisons. taskset -c 0-7is used to pin to this host’s P-core cluster in the local daemon launch profile.
Protocol: one JSON object per line on stdin, one JSON object per line on stdout.
Choose executable:
if command -v ai-detector-daemon >/dev/null 2>&1; then
DAEMON_CMD=(ai-detector-daemon)
else
DAEMON_CMD=(python3 detector_daemon.py)
fiLifecycle (single process):
coproc DAEMON {
OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 taskset -c 0-7 "${DAEMON_CMD[@]}" \
--local-files-only --threads 4 --device cpu --quiet
}
# Health check
printf '%s\n' '{"command":"health"}' >&"${DAEMON[1]}"
read -r health_resp <&"${DAEMON[0]}"
echo "health: $health_resp"
# Scoring request (uses daemon default profile)
printf '%s\n' '{"text":"This is a review sentence with enough context to score.","threshold":0.5}' >&"${DAEMON[1]}"
read -r score_resp <&"${DAEMON[0]}"
echo "score: $score_resp"
# Shutdown/unload path: request process exit and wait for it to terminate
printf '%s\n' '{"command":"shutdown"}' >&"${DAEMON[1]}"
read -r shutdown_resp <&"${DAEMON[0]}"
echo "shutdown: $shutdown_resp"
wait "$DAEMON_PID"Expected response shapes:
-
Health:
{"status":"ok","command":"health","loaded_experts":["meld","tmr","raid"],"device":"cpu","threads":4,"local_files_only":true} -
Scoring (default PL/OOD profile):
{"text_preview":"...","weights":{"meld":0.30,"tmr":0.00,"raid":0.10,"heuristic":0.60},"experts":{...},"ensemble":{...},"calibration":{...},"device":"cpu"} -
Shutdown:
{"status":"ok","command":"shutdown","ack":true,"loaded_experts":[]} -
Sending
{"command":"shutdown"}is the unload path for this daemon. The daemon is designed to be torn down cleanly this way. -
For reliable RSS recovery in operators, process exit/restart is the robust unload path; partial Python GC (
gc.collect()) may not return all memory to OS.
--experts subset behavior:
- Start with
--experts tmr,raidto preload only those two. - If a request does not provide
weights, defaults are remapped only to the preloaded set. For legacy behavior, launch with--profile legacy-default; with raw default0.34,0.33,0.33,tmr,raidbecomestmr=0.5,raid=0.5internally. - If a request provides explicit positive weight for a non-preloaded expert, the daemon returns an error.
Output behavior and privacy:
- JSON output shape matches the CLI contract used by
ai-detector, includingexperts,ensemble, andcalibration. ai_probabilityandhuman_probabilityare raw uncalibrated values unless a calibrated scoring path is configured.text_previewis still emitted exactly like the CLI and keeps the first 250 characters; avoid sending sensitive text if this output may be logged.
python3 -m venv .venv
. .venv/bin/activate
python -m pip install torch transformers safetensorsA separate, local tool that learns your Polish writing style from your own
samples and scores how much new text resembles it. It reuses this repo's
Polish tokenizer and AI-marker phrase lists, and can attach the existing
heuristic AI-likeness score via --with-heuristics.
What it does: builds a personal style profile, scores/ranks drafts by similarity to your style, explains which features match/diverge, and suggests conservative, meaning-preserving edits.
What it does NOT do: it is style similarity, not authorship verification and not proof of AI authorship. It never invents facts and never rewrites names, dates, numbers, quotes, URLs, citations, or code.
StyloMetrix pins spacy==3.7.2, which has no wheels above CPython 3.12, so the
whole project standardizes on Python 3.12 with numpy<2. Use the scripted,
reproducible setup (requires uv):
./scripts/setup_style_env.shThis builds .venv on 3.12, installs .[test,style,style-stylometrix], and
installs the Polish pl_nask model --no-deps from the TLS-valid Hugging Face
mirror (the official IPI PAN host has an expired certificate). Surface-only mode
needs just pip install -e ".[style]" and runs with --no-stylometrix.
# Build a profile from a folder of your texts (surface-only / degraded mode)
python -m personal_style_pl.cli build-profile \
--samples-dir examples/my_style_samples \
--output artifacts/profile.joblib --no-stylometrix
# Build with StyloMetrix (172 Polish features; needs the style-stylometrix env)
python -m personal_style_pl.cli build-profile \
--samples-dir examples/my_style_samples --output artifacts/profile.joblib
# Build from CSV (columns: text,source,date,genre)
python -m personal_style_pl.cli build-profile --csv examples/style_samples.csv \
--text-col text --output artifacts/profile.joblib
# Score one text (JSON), optionally attaching heuristic AI-likeness
python -m personal_style_pl.cli score --profile artifacts/profile.joblib \
--text-file examples/candidates/draft_a.txt --json --with-heuristics
# Rank candidate drafts -> CSV
python -m personal_style_pl.cli rank --profile artifacts/profile.joblib \
--candidates-dir examples/candidates --output artifacts/ranking.csv
# Human-readable profile summary
python -m personal_style_pl.cli describe-profile \
--profile artifacts/profile.joblib --output artifacts/profile_summary.md
# Conservative suggestions / deterministic edit
python -m personal_style_pl.cli suggest-edits --profile artifacts/profile.joblib \
--text-file examples/candidates/draft_a.txt --output artifacts/suggestions.md --mode light
python -m personal_style_pl.cli edit --profile artifacts/profile.joblib \
--text-file examples/candidates/draft_a.txt --output artifacts/draft_a.edited.txt --mode light
# Optional supervised mine-vs-other classifier (CSV labels: mine|other)
python -m personal_style_pl.cli train-supervised --csv data/contrast.csv \
--text-col text --label-col label --output artifacts/supervised_style_model.joblibstyle_match_score is 0–100. Labels: 80–100 close_to_my_style,
55–79 mixed, 0–54 far_from_my_style; insufficient_text when the
candidate is too short. confidence is high/medium/low based on profile
size and candidate length. The score blends a robust per-feature z-distance
(70%) with cosine-to-centroid (30%), with temperature calibrated from your
training chunks.
- Provide at least 10–20 samples or 5,000+ words for a meaningful profile; smaller profiles emit a weak-profile warning and force low confidence.
- N-grams are topic-sensitive and off by default (
--include-ngrams). - Domain/genre mismatch reduces reliability.
- Use it to understand and match your own style — not to impersonate others or to evade AI detectors.
An interpretable, length-normalized metric layer plus an AI-direction
overlay. It measures length-robust lexical diversity (MATTR / MTLD / HD-D,
borrowed from LexicalRichness),
sentence-length burstiness, repeated higher-order n-grams, and per-1,000-token
densities of em-dashes, boilerplate, transition and hedge phrases. These features
are on by default in every profile (disable with --no-rich). An optional,
gated TextDescriptives block adds
dependency-distance and POS-proportion features over pl_nask (entropy/perplexity
are excluded — they are NaN for Polish).
The ai-markers command reads those metrics against literature-backed AI
directionality (e.g. low burstiness, em-dash and boilerplate over-use, repeated
n-grams) and, when given a profile, scores each marker against your own
baseline:
# Interpretable AI-leaning report vs your style (advisory; abstains on Polish)
python -m personal_style_pl.cli ai-markers \
--text-file examples/candidates/draft_b.txt \
--profile artifacts/profile.joblib --json
# Optional Polish-LM perplexity signal (papuGaPT2; ~500 MB, fetched by the setup
# script). Build the baseline with it, then score the perplexity markers too:
python -m personal_style_pl.cli build-profile --samples-dir examples/my_style_samples \
--output artifacts/profile_ppl.joblib --no-stylometrix --with-perplexity
python -m personal_style_pl.cli ai-markers --text-file examples/candidates/draft_b.txt \
--profile artifacts/profile_ppl.joblib --with-perplexity --json
# The lean detector can attach the same block (lazy-imports; no effect on its deps):
python heuristic_detector.py --text-file examples/candidates/draft_b.txt --json --richsuggest-edits also appends an advisory "AI-leaning markers" section, and
train-supervised --features rich trains the mine-vs-other classifier on
surface + rich features.
Stance (important): these markers are an interpretable heuristic, not proof
of authorship. On Polish / out-of-distribution input the overlay abstains
(ood_or_unreliable: true, low confidence) rather than emit a confident AI/human
label — false positives are the dominant harm, and Polish-LM perplexity did not
cleanly separate human from AI-assisted text in our audit, so it is advisory only.
Polish-register calibration (v2.1). A windowed human-vs-AI study on Polish
polymer-chemistry prose (pre-2020 human papers vs generated AI text) found that some
markers are unreliable on this register and are therefore advisory-only on Polish
(shown but not counted in the score): mattr (terse, term-repeating scientific prose
has lower diversity than verbose AI — the polarity flips), em_dash_per_1k (Polish
journals use en-dashes for numeric ranges, e.g. 23–70°C), and repeated_4gram_ratio
(real papers repeat terminology). The markers that separated correctly — and are
counted on Polish — are burstiness_coeff, sentence_len_cv, transition_per_1k,
and boilerplate_per_1k. Median perplexity is surfaced as an advisory
perplexity_flag (calibrated threshold ≈47; modest TPR≈0.65 / FPR≈0.22), never a
verdict. On non-Polish input all markers are counted, since e.g. em-dash overuse is a
genuine English tell. Each marker row carries reliable/counted so you can see what
fed the score. (Caveat: calibrated on a small, single-author/single-AI-model corpus —
treat as a domain heuristic, not a universal classifier; terse "humanized" AI without
boilerplate/transitions can still evade the marker-based score.)
To use this repo to discriminate AI-generated from human text (rather than just style similarity), treat it as an advisory triage, not a verdict:
# 1. Build a KNOWN-HUMAN baseline from reference text you trust is human
# (e.g. the author's pre-LLM / pre-2020 writing). Rich metrics are on by default.
python -m personal_style_pl.cli build-profile --samples-dir known_human_samples/ \
--output artifacts/human_baseline.joblib --no-stylometrix --with-perplexity
# 2. Score the candidate text against that human baseline
python -m personal_style_pl.cli ai-markers --text-file candidate.txt \
--profile artifacts/human_baseline.joblib --with-perplexity --jsonHow to read the output:
ai_leaning_score(0–100) = % of counted markers leaning AI vs the human baseline. Higher = more AI-leaning. It is relative to the baseline, not an absolute probability.- Each marker row has
leaning(AI-leaning/matches/more-human-than-you/advisory_only) andcounted(whether it fed the score). Trust the counted ones. perplexity_flag(with--with-perplexity):leans_AI_low_perplexityvswithin_or_above_human_range— advisory, calibrated on Polish (threshold ≈47).ood_or_unreliable: true+ lowconfidenceon Polish → abstain: report as advisory, never as a definitive AI/human label.
What it can and cannot do (validated on Polish polymer-chemistry prose, v2.1):
- ✅ Reliably separates naive AI (formulaic phrasing, signposted transitions, uniform rhythm, low perplexity) from human: in a held-out study the real human text scored 0 while AI averaged ~29.
⚠️ Misses terse "humanized" AI that drops boilerplate/transitions — the marker score can read 0; only the (weak) perplexity signal hints. This is a hard limit of interpretable markers.⚠️ Polish is out-of-distribution — the overlay abstains and only some markers count (see the v2.1 calibration note above). Do not emit confident AI/human verdicts for Polish.⚠️ Use clean prose: PDF/LaTeX-extraction noise inflates perplexity and repetition metrics.
For a stronger, supervised approach when you have labelled examples, see
train-supervised --features rich (a mine-vs-other / human-vs-AI classifier on surface+rich
features). The discrimination methodology used to calibrate v2.1 (windowed human-vs-AI study) is
summarised in the ai_markers.py module docstring.
stylo_metrix and the pl_nask model are GPL-3.0. They are optional extras;
if you redistribute the project as a whole with them, GPL obligations apply.
The rich-metrics extras are permissive: LexicalRichness (MIT) and
TextDescriptives (Apache-2.0); the optional papuGaPT2 model
(dkleczek/papuGaPT2) is MIT.