Skip to content

DataBoar/data-boar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,994 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Data Boar

Try it in 30 seconds (no real data): data-boar --demo (or python main.py --demo) → open http://127.0.0.1:8088/pt-br/. Docker: docker run --rm -p 8088:8088 fabioleitao/data_boar:latest demo. Shell wrapper: ./scripts/demo.sh (requires uv). Windows step-by-step: 5-min QuickStart. Synthetic data only; loopback plaintext (--allow-insecure-http).

Data Boar — enterprise data discovery and risk governance: compliance-aware mapping of personal and sensitive data across your data soup (intelligence engine, not a single-jurisdiction “audit app”).

Data Boar mascot

Forensic-grade open-source PII scanner for LGPD · GDPR · evidence-ready compliance.

LGPD — real-world witness report (ISP field visit, Brazil): English · Português (Brasil)

Português (Brasil): README.pt_BR.md · docs/USAGE.pt_BR.md · → Audience guide — who should read which docs · 5 min (pt-BR): QUICKSTART.md


For decision-makers and compliance leads

Your organization needs to know where personal and sensitive data lives—to comply with LGPD, GDPR, CCPA, GLBA, and other major frameworks, and to avoid costly surprises. Data Boar is a multi-dialect, SRE-friendly engine for data discovery and actionable compliance (risk assessment and governance-oriented outputs — Excel, heatmaps, executive Markdown, evidence manifests). It helps you build compliance awareness and surface possible violations without out-of-control cost: one configurable engine that scans your data and reports what it finds, so IT, cybersecurity, compliance, and DPOs can take informed action.

What we surface: Beyond obvious PII (CPF, CNPJ – including the new alphanumeric format, email, phone), we use AI (ML and optional DL) to detect sensitive categories (health, religion, political opinion, biometric, genetic—LGPD Art. 5 II, GDPR Art. 9), field combinations that can re-identify individuals in context (LGPD Art. 5, GDPR Recital 26), and possible minor data (LGPD Art. 14, GDPR Art. 8). We recognise regional document names and flag ambiguous identifiers for manual confirmation, and reveal exposure across legacy columns, exports, dashboards, and multiple sources in one view—so you see gaps that manual checks or rule-only tools often miss. Findings carry norm_tag and the same risk vocabulary as the Excel output; when jargon piles up (quasi-identifier, categories, cross-border nuance), the Glossary aligns engineering, compliance, and procurement on one lexicon.

Children's and minors' data — first-class, not a footnote: Possible minor columns and values get dedicated detector logic, elevated report treatment (including optional deeper DB resampling and cross-reference with identifiers or health-like fields in the same table or path), and compliance-sample YAML vocabulary for US child-privacy contexts—always as inventory and triage signals, never as legal age verification. That is a deliberate linguistic category in the product: the same “boar” that digs through messy enterprise data is wired to surface child-related exposure early for DPO and security review. Technical limits and config keys: MINOR_DETECTION.md (pt-BR); concern-first map: MAP.md (pt-BR).

The real risk—shadow IT and beyond—often hides in parallel spreadsheets, forgotten folders, legacy databases, lack of standardization, tangled flows, poorly documented applications, exceptions, and excessive data collection. Data Boar keeps sniffing through your data soup to uncover those hidden ingredients—including renamed or cloaked files and weaker transport or storage—so compliance and legal teams see what’s really there, not just how it’s presented. Rich media today: optional metadata scans, image OCR, and subtitle sidecars help surface ingredients that plain full-text search often misses. On the horizon (phased, often opt-in): stronger signals for embedded trackers, steganography, and document-layer tricks (microtext, Unicode cloaking, nested embedded objects)—without promising exhaustive detection until each slice ships.

Hungry for your data soup: Like a boar, we dig into many sources and don't stop at the surface. Whatever the ingredients—files, SQL, NoSQL, APIs, Power BI, Dataverse, SharePoint, SMB/NFS, and more—we're built to ingest and digest it. We do not store or exfiltrate PII, only metadata (where found, pattern type, sensitivity), so you get visibility for remediation without moving data.

Why it holds up: One engine, config-driven (regex, ML/DL terms, norm tags, recommendation overrides)—no code changes to align with different frameworks. Excel reports, heatmaps, and trends across sessions; schedulable scans via API. Baseline patterns cover LGPD, GDPR, CCPA, GLBA, and sample configs extend to UK GDPR, EU GDPR, Benelux, PIPEDA, POPIA, APPI, PCI-DSS, US healthcare vocabulary, and more—each with framework guidance under compliance-samples (navigate from the table below). Regulatory positioning, sector-specific inventory limits (including US health), and optional professional services are summarised in Compliance and legal.

Multilingual and legacy encodings are supported; configurable timeouts and security hardening (validation, headers, audit) are in place. Compressed files: scan inside archives (zip, tar, gz, 7z with optional extra) via config, CLI, or the dashboard—full flags and I/O notes are in USAGE (table below). Optional content-type detection helps find renamed or cloaked files (e.g. a PDF saved with a misleading non-text extension such as .mp3—magic bytes still reveal the real format); early crypto/transport visibility (e.g. TLS vs plaintext) is collected for database and API targets.

Sniffing with judgment: Plain-text reads use a configurable character budget so the engine sees real structure—not a useless nib—before it decides; entertainment-shaped content routes noisy ML hints into a review band instead of crowding the report with spurious “certain” alerts, while hard pattern hits (IDs, payment data, strong PII) stay decisive for triage.

Roadmap: We continue to broaden discovery (richer files and media, cloud and enterprise connectors), improve alerts when scans finish, and keep security and dependency practices strong. Operator CLI (in progress): config validation before scans, session diff for regression evidence, and DSAR-oriented export helpers that default to metadata-first outputs—maintainers track slices in the repository’s planning tree (see docs/README.md Internal and reference). Language coverage in the UI and docs, and regional compliance samples, grow incrementally with demand—not every market at once. Already shipped (opt-in): jurisdiction hints—DPO-oriented notes on the Excel Report info sheet from metadata-only signals (not legal conclusions), so multinational teams can prioritise counsel review. Turn them on in config, CLI, dashboard, or API; details: USAGE. For boards and audits: outputs stay briefing-ready—leadership gets clear priorities and concrete evidence on data visibility and compliance posture without treating an engineering roadmap as the main slide deck. Framework lists, audit-oriented norms (e.g. ISO 27701, SOC 2), minors’ privacy sample profiles, and SBOM/API positioning: COMPLIANCE_FRAMEWORKS.md. Shipped vs phased: docs/releases/.

We invite you to get in touch to see how Data Boar can support your compliance journey.

Typical scenarios: Preparing for an audit or regulator request; mapping data before a migration or DLP rollout; raising compliance awareness without a full war room; supporting data privacy consultants as the technical evidence layer in LGPD adequacy engagements.

Current release & changelog: CHANGELOG.md · full notes under docs/releases/ · GitHub Releases. Docker Hub: fabioleitao/data_boar (latest + pinned tags).


The Architect's Vault

Investors, integration partners, and senior technical reviewers often skim the README and then ask: where is the decision trail? This section is the deliberate front door to narrative, positioning, and architecture records that sit beside the code—so the repository reads as a governed product, not “just another script.” Execution backlogs and PMO tables stay one hop away via docs/README.md (Internal and reference); per ADR 0004, this README avoids one-click Markdown links into the plans subtree under docs from this pitch surface.

If you need… Start here
Value proposition (boards, legal, compliance, procurement — concise brief) DECISION_MAKER_VALUE_BRIEF.md · pt-BR
Architecture Decision Records (context, decision, consequences — numbered series) docs/adr/README.md · pt-BR index
Narrative and architecture history (curated product story and stack evolution — placeholder until expanded) NARRATIVE_AND_ARCHITECTURE_HISTORY.md · pt-BR
Governance of the auditor (what the app can prove today about scans, exports, and operator evidence) ADR 0037 · SRE framing (pt-BR)
Concern-first navigation (minors, jurisdiction hints, CISO-style paths) MAP.md · pt-BR
Child / minor data (thresholds, cross-reference, samples — dedicated operator guide) MINOR_DETECTION.md · pt-BR
Full documentation index (all topics; entry to internal reference in one place) docs/README.md · pt-BR
Why evidence-first (public philosophy — no personal history) THE_WHY.md · pt-BR

Compliance methodology

Data Boar is positioned as technical inventory and triage: it finds where categories of personal and sensitive data may live, assigns technical severity and norm-oriented hints, and leaves lawful basis, purpose, and retention choices with DPO / counsel. For coursework (e.g. LGPD adequacy indices), we publish a concise verification-module map and a ROPA-style column prioritisation (what to automate first vs human-owned)—so the README is not only a feature list but a bridge to compliance method. Full detail: COMPLIANCE_METHODOLOGY.md · pt-BR.


Technical overview

Data Boar runs as a one-shot CLI audit or as a REST API (default port 8088) with a web dashboard. You configure targets (databases, filesystems, APIs, shares, Power BI, Dataverse) and sensitivity detection (regex + ML, optional DL) in a single YAML or JSON config file. It writes findings and session metadata to a local SQLite database and produces Excel reports and a heatmap PNG per session.

If you need… See
Open core vs Pro / Enterprise subscription scope (draft; pricing TBD) LICENSING_OPEN_CORE_AND_COMMERCIAL.md · pt-BR · LICENSING_SPEC.md (token technicalities, EN)
Compliance frameworks, samples, legal summary (DPOs, procurement) COMPLIANCE_FRAMEWORKS.md · COMPLIANCE_AND_LEGAL.md · compliance-samples/ (frameworks index)
Compliance methodology (verification modules, ROPA-style automation priorities) COMPLIANCE_METHODOLOGY.md · pt-BR
Reports and compliance outputs (XLSX, heatmap, audit JSON, maturity export; PDF roadmap) REPORTS_AND_COMPLIANCE_OUTPUTS.md · pt-BR
Product direction and release cadence docs/releases/ · GitHub Releases
Install, run, CLI/API reference, connectors, deploy Technical guide (EN) · Guia técnico (pt-BR)
Configuration schema, credentials, examples USAGE.md · USAGE.pt_BR.md
Deploy (Docker, Compose, Kubernetes) deploy/DEPLOY.md · deploy/DEPLOY.pt_BR.md
Sensitivity detection (ML/DL terms) SENSITIVITY_DETECTION.md · SENSITIVITY_DETECTION.pt_BR.md
Minor / child-related data (thresholds, optional full scan, cross-ref, samples) MINOR_DETECTION.md · MINOR_DETECTION.pt_BR.md
Testing, security, contributing docs/TESTING.md · SECURITY.md · CONTRIBUTING.md
**pip from PyPI pip install data-boar when published; until then git clone + uv sync — see CONTRIBUTING.md — Repository and install identity.

Quick start: the 5-min QuickStart walks through both paths (Docker or local uv, copy-paste); the full flag and configuration reference is in USAGE.md. On Linux (native, not Docker), install system libraries before uv sync — see Technical guide — Requirements and environment preparation. Do not commit root config.yaml (.gitignore); it may contain LAN paths and secrets — see CONTRIBUTING.md.

Full documentation index (browse all topics and languages): docs/README.md · docs/README.pt_BR.md.

Glossary (terms and domain language): docs/GLOSSARY.md · docs/GLOSSARY.pt_BR.md.

Legal: TERMS_OF_USE.md (pt-BR) · PRIVACY_POLICY.md (pt-BR).

License and copyright: LICENSE (pt-BR) · NOTICE (pt-BR) · docs/COPYRIGHT_AND_TRADEMARK.md (pt-BR).

Maintainer: Fabio Leitao on GitHub — Docker Hub namespace fabioleitao. Product blog (narrative updates, shorter posts): databoar.wordpress.com — canonical technical documentation stays in this repository (docs/). Other personal professional social links are not embedded in this README — see the GitHub profile (policy: **tests/test_pii_guard.py**, **docs/ops/COMMIT_AND_PR.md**).

About

Forensic-grade, open-source PII & sensitive-data discovery — LGPD/GDPR/CCPA-aware. Scans databases, filesystems, APIs & shares; deterministic detection (regex + ML), evidence-ready Excel/heatmap reports. Local-first: your data never leaves. Data soup in, structured findings out.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors