[data] write_lance(mode=CREATE) errors instead of silently overwriting#64364
[data] write_lance(mode=CREATE) errors instead of silently overwriting#64364tanmayrauth wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces safety checks in the Lance datasink to prevent overwriting existing datasets when using SaveMode.CREATE and to ensure datasets exist when using SaveMode.APPEND. It also adds corresponding unit tests to verify these behaviors. The review feedback highlights a potential issue in _open_existing_dataset where catching all ValueError exceptions could swallow unrelated configuration errors (such as invalid storage options). It is recommended to inspect the exception message to ensure it specifically indicates a missing dataset before returning None.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| return lance.LanceDataset( | ||
| self.uri, | ||
| storage_options=self.storage_options, | ||
| storage_options_provider=self.storage_options_provider, | ||
| ) | ||
| except ValueError: | ||
| # Lance raises ValueError when the dataset does not exist. | ||
| return None |
There was a problem hiding this comment.
Catching all ValueError exceptions and unconditionally returning None can swallow unrelated errors, such as invalid storage_options (e.g., invalid keys, types, or values) or other configuration issues. This can lead to highly misleading error messages (e.g., claiming the dataset does not exist when in fact the storage options are invalid).
To make this more robust, we should:
- Also catch
FileNotFoundError(orOSError) in case certain environments or future Lance versions raise it for missing local/remote paths. - Inspect the exception message to ensure it is indeed a "not found" or "does not exist" error before returning
None, and re-raise the exception otherwise.
| try: | |
| return lance.LanceDataset( | |
| self.uri, | |
| storage_options=self.storage_options, | |
| storage_options_provider=self.storage_options_provider, | |
| ) | |
| except ValueError: | |
| # Lance raises ValueError when the dataset does not exist. | |
| return None | |
| try: | |
| return lance.LanceDataset( | |
| self.uri, | |
| storage_options=self.storage_options, | |
| storage_options_provider=self.storage_options_provider, | |
| ) | |
| except (ValueError, FileNotFoundError) as e: | |
| # Only return None if the error indicates the dataset is missing. | |
| # Other ValueErrors (e.g. invalid storage options) should propagate. | |
| err_msg = str(e).lower() | |
| if "not found" in err_msg or "does not exist" in err_msg or "no such file" in err_msg: | |
| return None | |
| raise |
7822872 to
7b5039a
Compare
|
@abrarsheikh can you please review it? |
| msg = str(e).lower() | ||
| if "not found" in msg or "does not exist" in msg or "no such file" in msg: | ||
| return None |
There was a problem hiding this comment.
conditional block based on string matching is too brittle as these error messages may drift with package version, please figure out a more robust method.
There was a problem hiding this comment.
@abrarsheikh Yes, I've dropped the error-string matching entirely. Existence is now determined by whether lance.LanceDataset(...) opens successfully — a version-independent signal that also natively honors storage_options. For CREATE, a successful open means "exists" → raise; otherwise we proceed and let the actual write surface any real error (e.g. bad storage options) with Lance's own message. APPEND just opens directly and lets Lance raise its native not-found error. The tests no longer pin any error message.
SaveMode.CREATE is documented to "create new data and error if data already exists", but the Lance sink mapped both CREATE and OVERWRITE to lance.LanceOperation.Overwrite without any existence check, silently clobbering an existing dataset. This diverges from FileDatasink, which errors on CREATE when the destination already exists. on_write_start now checks for an existing dataset and raises on CREATE, directing users to OVERWRITE/APPEND. APPEND likewise raises a clear error when the dataset is missing. The check is skipped for namespace-backed writes, which declare/create the table location up front. Closes ray-project#64363 Signed-off-by: Tanmay Rauth <t_rauth@apple.com>
23ab01e to
1985caa
Compare
|
@abrarsheikh can you please review it when you have a moment? |
| bad ``storage_options``, etc.). Opening natively honors | ||
| ``storage_options``/``storage_options_provider``. | ||
| """ | ||
| import lance |
There was a problem hiding this comment.
why is this improve inside the function?
There was a problem hiding this comment.
lance is the optional pylance dependency, so it's imported lazily inside functions throughout this module rather than at the top — on_write_complete (line 307) and _write_fragment (line 101) do the same. A top-level import would make import lance_datasink itself fail for anyone without pylance installed, which is exactly why on_write_start guards with _check_import(..., package="pylance") first.
Happy to centralize it into one lazy helper if you'd prefer a single import site, but I kept it inline to match the existing pattern.
SaveMode.CREATE is documented to "create new data and error if data already exists", but the Lance sink mapped both CREATE and OVERWRITE to lance.LanceOperation.Overwrite without any existence check, silently clobbering an existing dataset. This diverges from FileDatasink, which errors on CREATE when the destination already exists.
on_write_start now checks for an existing dataset and raises on CREATE, directing users to OVERWRITE/APPEND. APPEND likewise raises a clear error when the dataset is missing. The check is skipped for namespace-backed writes, which declare/create the table location up front.
Closes #64363