Add --split flag for per-object file output by rickythefox · Pull Request #2 · dbbuilder/SQLExtract

rickythefox · 2026-02-26T05:44:10Z

Summary

Adds --split CLI flag that writes each database object to its own file, organized in typed subdirectories (schemas/, tables/, constraints/, indexes/, views/, procedures/, functions/, deferred_fks/, seed_data/)
Adds sanitize_filename() helper to make SQL object names filesystem-safe
Restructures seed data collection to Dict[str, List[str]] keyed by schema.table to support per-table file output

Test plan

--help shows the new --split flag
Tested against live dev DB (sql-worko-wdp-silver-dev) — all 266 objects written to correct subdirectories
File counts match extraction counts (5 schemas, 70 tables, 54 constraints, 16 indexes, 12 views, 62 procedures, 47 functions)
Empty directories (deferred_fks, seed_data) are not created when there's no data
CodeRabbit review passed (empty filename edge case fixed)
Verify --split is not set by default (modular format still works)

Summary by CodeRabbit

New Features
- CLI --split flag to enable split output mode (one file per object, per-type directories; per-table seed files).
Improvements
- Seed-data extraction no longer produces empty entries and integrates with split output.
- Safer filename sanitization and clearer extraction/output error types.
Tests
- Added/expanded tests for split routing, split-format output, filename sanitization, collisions/overwrite.
Chores
- Packaging discovery and formatter configuration updated.

setuptools could not find the `src` package without an explicit `[tool.setuptools.packages.find]` directive, causing a ModuleNotFoundError when running the installed `sqlextract` console script.

Writes each schema object to its own file organized in typed subdirectories (schemas/, tables/, constraints/, indexes/, views/, procedures/, functions/, deferred_fks/, seed_data/). Useful for version control and code review where individual object diffs are easier to track.

coderabbitai · 2026-02-26T05:44:32Z

📝 Walkthrough

Walkthrough

Adds a CLI --split flag that routes extraction to a new split-output path, implements per-object file output in the formatter, threads split through the extractor, adds filename sanitization and new SQLExtractError subclasses, and updates configs and tests accordingly.

Changes

Cohort / File(s)	Summary
Project config `pyproject.toml`	Enable package discovery for `src*`; extend Black to include `\.pyi?$`.
CLI `src/cli.py`	Add `--split` option, add `split: bool` parameter to `main`, pass `split` into `extractor.extract(...)`.
Extraction core `src/extractor.py`	Add `split: bool = False` to `DatabaseExtractor.extract`; collect seed data into a per-table dict (only when rows present); branch to `write_split_format(...)` when `split=True`, otherwise flatten seed data for `write_modular_format`.
Formatter / Output `src/formatter.py`	Import `sanitize_filename`; add `SQLFormatter.write_split_format(...)` that writes each object into its own file under type subdirectories (schemas, tables, constraints, indexes, views, procedures, functions, deferred_fks, seed_data). Note: diff contains a duplicate implementation.
Utilities & errors `src/utils.py`	Add `sanitize_filename(name: str)` (replace unsafe chars, trim trailing dots/spaces, reject empty results, escape Windows reserved device names) and new exception classes: `SQLExtractError`, `ConnectionError`, `ExtractionError`, `OutputError`.
Tests `tests/test_extractor.py`, `tests/test_formatter.py`, `tests/test_utils.py`	Add tests covering split flag routing, write_split_format behavior (filenames, directories, collisions, seed data), and extensive sanitize_filename cases; extend test mocks and helpers.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI
    participant Extractor as Extractor
    participant Formatter as Formatter
    participant FS as FileSystem

    User->>CLI: run command with --split
    CLI->>CLI: parse args (split=True)
    CLI->>Extractor: extract(output_dir..., split=True)
    Extractor->>Extractor: collect metadata and per-table seed_data dict
    Extractor->>Formatter: write_split_format(..., seed_data_dict)
    Formatter->>Formatter: iterate object collections
    Formatter->>Formatter: sanitize_filename(name)
    Formatter->>FS: create type subdirectories
    Formatter->>FS: write individual files per object
    FS-->>Formatter: confirm writes
    Formatter-->>Extractor: return status
    Extractor-->>CLI: return completion
    CLI-->>User: print result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped in with a tiny new flag, --split,

Each table found a door, each schema its kit.
Filenames trimmed, reserved names kept at bay,
Errors named softly to guide the new way.
I cheer for neat folders — a carrot-shaped lit bit.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Add --split flag for per-object file output' directly and clearly describes the primary change: a new CLI flag that enables writing database objects to separate files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/formatter.py`:
- Around line 460-468: The _write helper currently writes sanitized filenames
(via sanitize_filename) directly and can silently overwrite when different
source names map to the same sanitized path; update _write to detect collisions
by computing the target path from sanitize_filename(filename), then if the path
already exists either (a) compare existing file contents and raise an error if
they differ, or (b) append a deterministic disambiguator (e.g., a short hash of
the original filename or an incremental suffix) to the filename before writing
to avoid overwriting; reference the _write function and sanitize_filename to
implement this check and ensure a clear error or unique filename is produced
instead of silent overwrite.

In `@src/utils.py`:
- Around line 98-109: The sanitize_filename function currently replaces illegal
characters but still returns Windows-reserved basenames (e.g., CON, PRN, AUX,
NUL, COM1..COM9, LPT1..LPT9) which cause failures; update sanitize_filename to
split the filename and extension (use os.path.splitext) check the base name
case-insensitively against the reserved set, and if it matches, modify the base
(for example prefix or suffix an underscore) before recombining so the returned
sanitized name is not a reserved Windows device name; keep the existing
character-replacement and rstrip logic and still raise ValueError if the final
sanitized string is empty.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e708c0 and 2ae965d.

📒 Files selected for processing (5)

pyproject.toml
src/cli.py
src/extractor.py
src/formatter.py
src/utils.py

- Detect filename collisions in _write() when different SQL names sanitize to the same path, raising OutputError instead of silently overwriting - Escape Windows-reserved device names (CON, PRN, AUX, NUL, COM1-9, LPT1-9) in sanitize_filename() by appending underscore to the base name

33 new tests covering sanitize_filename (unsafe chars, empty input, Windows reserved names), write_split_format (directory structure, file content, collision error), and extractor split flag routing.

coderabbitai

♻️ Duplicate comments (1)

src/formatter.py (1)

460-475: ⚠️ Potential issue | 🟠 Major

Scope collision detection to the current run and normalize write failures as OutputError.

Line 467 currently treats any pre-existing file as a collision, so rerunning --split into the same directory fails even when there is no in-run name collision. Also, Line 464/473 can still surface raw ValueError/OSError instead of OutputError.

🔧 Proposed fix

         header = self.write_header()
+        written_paths: dict[str, str] = {}

         def _write(subdir: str, filename: str, content: str) -> None:
             """Write a single object file inside a typed subdirectory."""
             dir_path = os.path.join(self.output_dir, subdir)
             os.makedirs(dir_path, exist_ok=True)
-            path = os.path.join(dir_path, sanitize_filename(filename))
-
-            # Detect collisions from different source names sanitizing to the same path
-            if os.path.exists(path):
-                raise OutputError(
-                    f"Filename collision: '{filename}' maps to '{path}' which already exists. "
-                    f"Two different objects sanitized to the same filename."
-                )
-
-            with open(path, 'w', encoding='utf-8') as f:
-                f.write(header)
-                f.write(content)
+            try:
+                safe_filename = sanitize_filename(filename)
+                path = os.path.join(dir_path, safe_filename)
+                collision_key = os.path.normcase(os.path.normpath(path))
+                previous_source = written_paths.get(collision_key)
+                if previous_source is not None and previous_source != filename:
+                    raise OutputError(
+                        f"Filename collision after sanitization: '{previous_source}' vs '{filename}' -> '{safe_filename}'"
+                    )
+                written_paths[collision_key] = filename
+
+                with open(path, 'w', encoding='utf-8') as f:
+                    f.write(header)
+                    f.write(content)
+            except (OSError, ValueError) as exc:
+                raise OutputError(f"Failed to write split output file: {filename}") from exc

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ae965d and 26006ad.

📒 Files selected for processing (2)

src/formatter.py
src/utils.py

Collision detection now uses an in-memory dict instead of filesystem checks, so re-running --split into the same directory works. Raw ValueError/OSError from sanitize_filename and file I/O are wrapped as OutputError. Added test for re-run idempotency.

coderabbitai · 2026-02-26T06:43:03Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

rickythefox · 2026-02-26T06:43:30Z

@coderabbitai review

coderabbitai · 2026-02-26T06:43:39Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

rickythefox · 2026-02-26T06:52:06Z

@coderabbitai review

coderabbitai · 2026-02-26T06:52:17Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (1)

tests/test_extractor.py (1)

68-68: Use temp-path fixtures instead of hardcoded /tmp/test.

Line 68, Line 118, and Line 167 hardcode a Unix-specific path. Switching to tmp_path/TemporaryDirectory will make these tests portable and less stateful.

♻️ Suggested update

-    def _run_extract(self, split):
+    def _run_extract(self, split, output_dir):
@@
-            extractor.extract(output_dir="/tmp/test", split=split)
+            extractor.extract(output_dir=str(output_dir), split=split)
             return formatter

-    def test_split_true_calls_write_split_format(self):
+    def test_split_true_calls_write_split_format(self, tmp_path):
@@
-        formatter = self._run_extract(split=True)
+        formatter = self._run_extract(split=True, output_dir=tmp_path)

-    def test_split_false_calls_write_modular_format(self):
+    def test_split_false_calls_write_modular_format(self, tmp_path):
@@
-        formatter = self._run_extract(split=False)
+        formatter = self._run_extract(split=False, output_dir=tmp_path)

Also applies to: 118-118, 167-167

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_extractor.py` at line 68, Replace the hardcoded "/tmp/test" in
extractor.extract(...) calls with a temporary directory created by the test (use
the pytest tmp_path fixture or tempfile.TemporaryDirectory), e.g., create a Path
from tmp_path or the TemporaryDirectory's name and pass that as the output_dir
argument to extractor.extract; ensure tests use the temporary path for all three
occurrences and let pytest/tempfile handle cleanup so tests become OS-agnostic
and non-stateful.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_extractor.py`:
- Line 68: Replace the hardcoded "/tmp/test" in extractor.extract(...) calls
with a temporary directory created by the test (use the pytest tmp_path fixture
or tempfile.TemporaryDirectory), e.g., create a Path from tmp_path or the
TemporaryDirectory's name and pass that as the output_dir argument to
extractor.extract; ensure tests use the temporary path for all three occurrences
and let pytest/tempfile handle cleanup so tests become OS-agnostic and
non-stateful.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26006ad and fe7e0d7.

📒 Files selected for processing (4)

src/formatter.py
tests/test_extractor.py
tests/test_formatter.py
tests/test_utils.py

rickythefox added 2 commits February 26, 2026 12:18

Fix package discovery so the sqlextract entry point resolves correctly

e8be952

setuptools could not find the `src` package without an explicit `[tool.setuptools.packages.find]` directive, causing a ModuleNotFoundError when running the installed `sqlextract` console script.

coderabbitai Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread src/formatter.py

Comment thread src/utils.py

rickythefox added 2 commits February 26, 2026 12:53

Add tests for --split output, sanitize_filename, and collision detection

e75a0a1

33 new tests covering sanitize_filename (unsafe chars, empty input, Windows reserved names), write_split_format (directory structure, file content, collision error), and extractor split flag routing.

coderabbitai Bot reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --split flag for per-object file output#2

Add --split flag for per-object file output#2
rickythefox wants to merge 5 commits into
dbbuilder:masterfrom
rickythefox:feature/split-output

rickythefox commented Feb 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 26, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot commented Feb 26, 2026

Uh oh!

rickythefox commented Feb 26, 2026

Uh oh!

coderabbitai Bot commented Feb 26, 2026

Uh oh!

rickythefox commented Feb 26, 2026

Uh oh!

coderabbitai Bot commented Feb 26, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rickythefox commented Feb 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Feb 26, 2026

Uh oh!

rickythefox commented Feb 26, 2026

Uh oh!

coderabbitai Bot commented Feb 26, 2026

Uh oh!

rickythefox commented Feb 26, 2026

Uh oh!

coderabbitai Bot commented Feb 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rickythefox commented Feb 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 26, 2026 •

edited

Loading