Skip to content

[data] write_lance(mode=CREATE) errors instead of silently overwriting#64364

Open
tanmayrauth wants to merge 2 commits into
ray-project:masterfrom
tanmayrauth:fix/lance-create-mode-errors-if-exists
Open

[data] write_lance(mode=CREATE) errors instead of silently overwriting#64364
tanmayrauth wants to merge 2 commits into
ray-project:masterfrom
tanmayrauth:fix/lance-create-mode-errors-if-exists

Conversation

@tanmayrauth

Copy link
Copy Markdown

SaveMode.CREATE is documented to "create new data and error if data already exists", but the Lance sink mapped both CREATE and OVERWRITE to lance.LanceOperation.Overwrite without any existence check, silently clobbering an existing dataset. This diverges from FileDatasink, which errors on CREATE when the destination already exists.

on_write_start now checks for an existing dataset and raises on CREATE, directing users to OVERWRITE/APPEND. APPEND likewise raises a clear error when the dataset is missing. The check is skipped for namespace-backed writes, which declare/create the table location up front.

Closes #64363

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces safety checks in the Lance datasink to prevent overwriting existing datasets when using SaveMode.CREATE and to ensure datasets exist when using SaveMode.APPEND. It also adds corresponding unit tests to verify these behaviors. The review feedback highlights a potential issue in _open_existing_dataset where catching all ValueError exceptions could swallow unrelated configuration errors (such as invalid storage options). It is recommended to inspect the exception message to ensure it specifically indicates a missing dataset before returning None.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +252 to +260
try:
return lance.LanceDataset(
self.uri,
storage_options=self.storage_options,
storage_options_provider=self.storage_options_provider,
)
except ValueError:
# Lance raises ValueError when the dataset does not exist.
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching all ValueError exceptions and unconditionally returning None can swallow unrelated errors, such as invalid storage_options (e.g., invalid keys, types, or values) or other configuration issues. This can lead to highly misleading error messages (e.g., claiming the dataset does not exist when in fact the storage options are invalid).

To make this more robust, we should:

  1. Also catch FileNotFoundError (or OSError) in case certain environments or future Lance versions raise it for missing local/remote paths.
  2. Inspect the exception message to ensure it is indeed a "not found" or "does not exist" error before returning None, and re-raise the exception otherwise.
Suggested change
try:
return lance.LanceDataset(
self.uri,
storage_options=self.storage_options,
storage_options_provider=self.storage_options_provider,
)
except ValueError:
# Lance raises ValueError when the dataset does not exist.
return None
try:
return lance.LanceDataset(
self.uri,
storage_options=self.storage_options,
storage_options_provider=self.storage_options_provider,
)
except (ValueError, FileNotFoundError) as e:
# Only return None if the error indicates the dataset is missing.
# Other ValueErrors (e.g. invalid storage options) should propagate.
err_msg = str(e).lower()
if "not found" in err_msg or "does not exist" in err_msg or "no such file" in err_msg:
return None
raise

@tanmayrauth tanmayrauth force-pushed the fix/lance-create-mode-errors-if-exists branch from 7822872 to 7b5039a Compare June 26, 2026 04:29
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jun 26, 2026
@tanmayrauth

Copy link
Copy Markdown
Author

@abrarsheikh can you please review it?

Comment on lines +264 to +266
msg = str(e).lower()
if "not found" in msg or "does not exist" in msg or "no such file" in msg:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conditional block based on string matching is too brittle as these error messages may drift with package version, please figure out a more robust method.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abrarsheikh Yes, I've dropped the error-string matching entirely. Existence is now determined by whether lance.LanceDataset(...) opens successfully — a version-independent signal that also natively honors storage_options. For CREATE, a successful open means "exists" → raise; otherwise we proceed and let the actual write surface any real error (e.g. bad storage options) with Lance's own message. APPEND just opens directly and lets Lance raise its native not-found error. The tests no longer pin any error message.

SaveMode.CREATE is documented to "create new data and error if data
already exists", but the Lance sink mapped both CREATE and OVERWRITE to
lance.LanceOperation.Overwrite without any existence check, silently
clobbering an existing dataset. This diverges from FileDatasink, which
errors on CREATE when the destination already exists.

on_write_start now checks for an existing dataset and raises on CREATE,
directing users to OVERWRITE/APPEND. APPEND likewise raises a clear error
when the dataset is missing. The check is skipped for namespace-backed
writes, which declare/create the table location up front.

Closes ray-project#64363

Signed-off-by: Tanmay Rauth <t_rauth@apple.com>
@tanmayrauth tanmayrauth force-pushed the fix/lance-create-mode-errors-if-exists branch from 23ab01e to 1985caa Compare June 26, 2026 20:24
@tanmayrauth

Copy link
Copy Markdown
Author

@abrarsheikh can you please review it when you have a moment?

bad ``storage_options``, etc.). Opening natively honors
``storage_options``/``storage_options_provider``.
"""
import lance

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this improve inside the function?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lance is the optional pylance dependency, so it's imported lazily inside functions throughout this module rather than at the top — on_write_complete (line 307) and _write_fragment (line 101) do the same. A top-level import would make import lance_datasink itself fail for anyone without pylance installed, which is exactly why on_write_start guards with _check_import(..., package="pylance") first.
Happy to centralize it into one lazy helper if you'd prefer a single import site, but I kept it inline to match the existing pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] write_lance(mode=SaveMode.CREATE) silently overwrites an existing Lance dataset

2 participants