Skip to content

Add --split flag for per-object file output#2

Open
rickythefox wants to merge 5 commits into
dbbuilder:masterfrom
rickythefox:feature/split-output
Open

Add --split flag for per-object file output#2
rickythefox wants to merge 5 commits into
dbbuilder:masterfrom
rickythefox:feature/split-output

Conversation

@rickythefox
Copy link
Copy Markdown

@rickythefox rickythefox commented Feb 26, 2026

Summary

  • Adds --split CLI flag that writes each database object to its own file, organized in typed subdirectories (schemas/, tables/, constraints/, indexes/, views/, procedures/, functions/, deferred_fks/, seed_data/)
  • Adds sanitize_filename() helper to make SQL object names filesystem-safe
  • Restructures seed data collection to Dict[str, List[str]] keyed by schema.table to support per-table file output

Test plan

  • --help shows the new --split flag
  • Tested against live dev DB (sql-worko-wdp-silver-dev) — all 266 objects written to correct subdirectories
  • File counts match extraction counts (5 schemas, 70 tables, 54 constraints, 16 indexes, 12 views, 62 procedures, 47 functions)
  • Empty directories (deferred_fks, seed_data) are not created when there's no data
  • CodeRabbit review passed (empty filename edge case fixed)
  • Verify --split is not set by default (modular format still works)

Summary by CodeRabbit

  • New Features

    • CLI --split flag to enable split output mode (one file per object, per-type directories; per-table seed files).
  • Improvements

    • Seed-data extraction no longer produces empty entries and integrates with split output.
    • Safer filename sanitization and clearer extraction/output error types.
  • Tests

    • Added/expanded tests for split routing, split-format output, filename sanitization, collisions/overwrite.
  • Chores

    • Packaging discovery and formatter configuration updated.

setuptools could not find the `src` package without an explicit
`[tool.setuptools.packages.find]` directive, causing a ModuleNotFoundError
when running the installed `sqlextract` console script.
Writes each schema object to its own file organized in typed subdirectories
(schemas/, tables/, constraints/, indexes/, views/, procedures/, functions/,
deferred_fks/, seed_data/). Useful for version control and code review
where individual object diffs are easier to track.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

Adds a CLI --split flag that routes extraction to a new split-output path, implements per-object file output in the formatter, threads split through the extractor, adds filename sanitization and new SQLExtractError subclasses, and updates configs and tests accordingly.

Changes

Cohort / File(s) Summary
Project config
pyproject.toml
Enable package discovery for src*; extend Black to include \.pyi?$.
CLI
src/cli.py
Add --split option, add split: bool parameter to main, pass split into extractor.extract(...).
Extraction core
src/extractor.py
Add split: bool = False to DatabaseExtractor.extract; collect seed data into a per-table dict (only when rows present); branch to write_split_format(...) when split=True, otherwise flatten seed data for write_modular_format.
Formatter / Output
src/formatter.py
Import sanitize_filename; add SQLFormatter.write_split_format(...) that writes each object into its own file under type subdirectories (schemas, tables, constraints, indexes, views, procedures, functions, deferred_fks, seed_data). Note: diff contains a duplicate implementation.
Utilities & errors
src/utils.py
Add sanitize_filename(name: str) (replace unsafe chars, trim trailing dots/spaces, reject empty results, escape Windows reserved device names) and new exception classes: SQLExtractError, ConnectionError, ExtractionError, OutputError.
Tests
tests/test_extractor.py, tests/test_formatter.py, tests/test_utils.py
Add tests covering split flag routing, write_split_format behavior (filenames, directories, collisions, seed data), and extensive sanitize_filename cases; extend test mocks and helpers.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI
    participant Extractor as Extractor
    participant Formatter as Formatter
    participant FS as FileSystem

    User->>CLI: run command with --split
    CLI->>CLI: parse args (split=True)
    CLI->>Extractor: extract(output_dir..., split=True)
    Extractor->>Extractor: collect metadata and per-table seed_data dict
    Extractor->>Formatter: write_split_format(..., seed_data_dict)
    Formatter->>Formatter: iterate object collections
    Formatter->>Formatter: sanitize_filename(name)
    Formatter->>FS: create type subdirectories
    Formatter->>FS: write individual files per object
    FS-->>Formatter: confirm writes
    Formatter-->>Extractor: return status
    Extractor-->>CLI: return completion
    CLI-->>User: print result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped in with a tiny new flag, --split,

Each table found a door, each schema its kit.
Filenames trimmed, reserved names kept at bay,
Errors named softly to guide the new way.
I cheer for neat folders — a carrot-shaped lit bit.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add --split flag for per-object file output' directly and clearly describes the primary change: a new CLI flag that enables writing database objects to separate files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/formatter.py`:
- Around line 460-468: The _write helper currently writes sanitized filenames
(via sanitize_filename) directly and can silently overwrite when different
source names map to the same sanitized path; update _write to detect collisions
by computing the target path from sanitize_filename(filename), then if the path
already exists either (a) compare existing file contents and raise an error if
they differ, or (b) append a deterministic disambiguator (e.g., a short hash of
the original filename or an incremental suffix) to the filename before writing
to avoid overwriting; reference the _write function and sanitize_filename to
implement this check and ensure a clear error or unique filename is produced
instead of silent overwrite.

In `@src/utils.py`:
- Around line 98-109: The sanitize_filename function currently replaces illegal
characters but still returns Windows-reserved basenames (e.g., CON, PRN, AUX,
NUL, COM1..COM9, LPT1..LPT9) which cause failures; update sanitize_filename to
split the filename and extension (use os.path.splitext) check the base name
case-insensitively against the reserved set, and if it matches, modify the base
(for example prefix or suffix an underscore) before recombining so the returned
sanitized name is not a reserved Windows device name; keep the existing
character-replacement and rstrip logic and still raise ValueError if the final
sanitized string is empty.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e708c0 and 2ae965d.

📒 Files selected for processing (5)
  • pyproject.toml
  • src/cli.py
  • src/extractor.py
  • src/formatter.py
  • src/utils.py

Comment thread src/formatter.py
Comment thread src/utils.py
- Detect filename collisions in _write() when different SQL names sanitize
  to the same path, raising OutputError instead of silently overwriting
- Escape Windows-reserved device names (CON, PRN, AUX, NUL, COM1-9, LPT1-9)
  in sanitize_filename() by appending underscore to the base name
33 new tests covering sanitize_filename (unsafe chars, empty input,
Windows reserved names), write_split_format (directory structure, file
content, collision error), and extractor split flag routing.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/formatter.py (1)

460-475: ⚠️ Potential issue | 🟠 Major

Scope collision detection to the current run and normalize write failures as OutputError.

Line 467 currently treats any pre-existing file as a collision, so rerunning --split into the same directory fails even when there is no in-run name collision. Also, Line 464/473 can still surface raw ValueError/OSError instead of OutputError.

🔧 Proposed fix
         header = self.write_header()
+        written_paths: dict[str, str] = {}

         def _write(subdir: str, filename: str, content: str) -> None:
             """Write a single object file inside a typed subdirectory."""
             dir_path = os.path.join(self.output_dir, subdir)
             os.makedirs(dir_path, exist_ok=True)
-            path = os.path.join(dir_path, sanitize_filename(filename))
-
-            # Detect collisions from different source names sanitizing to the same path
-            if os.path.exists(path):
-                raise OutputError(
-                    f"Filename collision: '{filename}' maps to '{path}' which already exists. "
-                    f"Two different objects sanitized to the same filename."
-                )
-
-            with open(path, 'w', encoding='utf-8') as f:
-                f.write(header)
-                f.write(content)
+            try:
+                safe_filename = sanitize_filename(filename)
+                path = os.path.join(dir_path, safe_filename)
+                collision_key = os.path.normcase(os.path.normpath(path))
+                previous_source = written_paths.get(collision_key)
+                if previous_source is not None and previous_source != filename:
+                    raise OutputError(
+                        f"Filename collision after sanitization: '{previous_source}' vs '{filename}' -> '{safe_filename}'"
+                    )
+                written_paths[collision_key] = filename
+
+                with open(path, 'w', encoding='utf-8') as f:
+                    f.write(header)
+                    f.write(content)
+            except (OSError, ValueError) as exc:
+                raise OutputError(f"Failed to write split output file: {filename}") from exc

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ae965d and 26006ad.

📒 Files selected for processing (2)
  • src/formatter.py
  • src/utils.py

Collision detection now uses an in-memory dict instead of filesystem
checks, so re-running --split into the same directory works. Raw
ValueError/OSError from sanitize_filename and file I/O are wrapped
as OutputError. Added test for re-run idempotency.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rickythefox
Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rickythefox
Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_extractor.py (1)

68-68: Use temp-path fixtures instead of hardcoded /tmp/test.

Line 68, Line 118, and Line 167 hardcode a Unix-specific path. Switching to tmp_path/TemporaryDirectory will make these tests portable and less stateful.

♻️ Suggested update
-    def _run_extract(self, split):
+    def _run_extract(self, split, output_dir):
@@
-            extractor.extract(output_dir="/tmp/test", split=split)
+            extractor.extract(output_dir=str(output_dir), split=split)
             return formatter

-    def test_split_true_calls_write_split_format(self):
+    def test_split_true_calls_write_split_format(self, tmp_path):
@@
-        formatter = self._run_extract(split=True)
+        formatter = self._run_extract(split=True, output_dir=tmp_path)

-    def test_split_false_calls_write_modular_format(self):
+    def test_split_false_calls_write_modular_format(self, tmp_path):
@@
-        formatter = self._run_extract(split=False)
+        formatter = self._run_extract(split=False, output_dir=tmp_path)

Also applies to: 118-118, 167-167

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_extractor.py` at line 68, Replace the hardcoded "/tmp/test" in
extractor.extract(...) calls with a temporary directory created by the test (use
the pytest tmp_path fixture or tempfile.TemporaryDirectory), e.g., create a Path
from tmp_path or the TemporaryDirectory's name and pass that as the output_dir
argument to extractor.extract; ensure tests use the temporary path for all three
occurrences and let pytest/tempfile handle cleanup so tests become OS-agnostic
and non-stateful.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_extractor.py`:
- Line 68: Replace the hardcoded "/tmp/test" in extractor.extract(...) calls
with a temporary directory created by the test (use the pytest tmp_path fixture
or tempfile.TemporaryDirectory), e.g., create a Path from tmp_path or the
TemporaryDirectory's name and pass that as the output_dir argument to
extractor.extract; ensure tests use the temporary path for all three occurrences
and let pytest/tempfile handle cleanup so tests become OS-agnostic and
non-stateful.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26006ad and fe7e0d7.

📒 Files selected for processing (4)
  • src/formatter.py
  • tests/test_extractor.py
  • tests/test_formatter.py
  • tests/test_utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant