Skip to content

fix: Faiss read/write failure on Windows with non-ASCII user paths#8185

Open
irmia2026 wants to merge 1 commit into
AstrBotDevs:masterfrom
irmia2026:fix/faiss-unicode-path
Open

fix: Faiss read/write failure on Windows with non-ASCII user paths#8185
irmia2026 wants to merge 1 commit into
AstrBotDevs:masterfrom
irmia2026:fix/faiss-unicode-path

Conversation

@irmia2026
Copy link
Copy Markdown

@irmia2026 irmia2026 commented May 14, 2026

Summary

Fix Faiss index read/write failure on Windows when the path contains non-ASCII characters (Chinese, Japanese, Korean, Cyrillic, etc.).

Root Cause

Faiss C++ (io.cpp) uses fopen(const char*) which expects ANSI codepage on Windows, but Python passes paths as UTF-8 bytes through the C extension bridge. Multi-byte characters get garbled → No such file or directory.

Impact

All Windows users with non-ASCII usernames are unable to use the knowledge base feature. Upload fails with:

存储失败:文本块已生成,但写入知识库索引时出错

While chunks are written to doc.db (SQLite via Python), the Faiss vector index and kb_documents metadata are both lost.

Fix

Temp-file bridge: Faiss reads/writes a pure-ASCII temp file; Python shutil (using Windows wide-char APIs natively) handles the real path.

Additional safeguards:

  • _safe_temp_dir() — priority cascade ensuring ASCII temp directory even when TEMP env points to a non-ASCII path (falls back to %SystemRoot%\Temp)
  • PID-based temp filenames for concurrency safety
  • os.makedirs before shutil.move for first-time writes in nested directories

Files Changed

One file: app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py

  • New: _safe_temp_dir(), _read_index(), _write_index()
  • Modified: __init__(), save_index()
  • No public API changes. Fully backward-compatible — on Linux/macOS/ASCII-Windows, the try: direct read succeeds and the fallback is never triggered.

Verification

7 tests passed on Windows with Chinese username path: write/read round-trip, concurrent access, first-time write, nested directories, polluted TEMP fallback.

Summary by Sourcery

Add a Faiss embedding storage wrapper that safely reads and writes indices, including on Windows paths with non-ASCII characters.

Bug Fixes:

  • Resolve Faiss index read/write failures on Windows when the target path contains non-ASCII characters by routing I/O through ASCII-only temporary files.

Enhancements:

  • Introduce a robust EmbeddingStorage abstraction around Faiss indices with async insert, batch insert, search, delete, and persistence operations.
  • Add a safe temporary-directory selection strategy to ensure ASCII-only, writable temp locations are used for Faiss I/O bridging on Windows.

@auto-assign auto-assign Bot requested review from LIghtJUNction and Raven95676 May 14, 2026 02:41
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 14, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • In _write_index, os.path.dirname(path) will return an empty string for bare filenames, causing os.makedirs('') to fail; consider guarding with a check and only creating the directory when it is non-empty.
  • Both _read_index and _write_index use PID-based temp filenames which can collide on concurrent calls within the same process; using tempfile.mkstemp or including a random/UUID suffix would make these temp paths safer.
  • _safe_temp_dir currently hardcodes a C:\Windows\Temp fallback even on non-Windows platforms; adding an os.name == 'nt' branch (and a POSIX-appropriate fallback) would avoid creating Windows-style paths on Unix-like systems.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_write_index`, `os.path.dirname(path)` will return an empty string for bare filenames, causing `os.makedirs('')` to fail; consider guarding with a check and only creating the directory when it is non-empty.
- Both `_read_index` and `_write_index` use PID-based temp filenames which can collide on concurrent calls within the same process; using `tempfile.mkstemp` or including a random/UUID suffix would make these temp paths safer.
- `_safe_temp_dir` currently hardcodes a `C:\Windows\Temp` fallback even on non-Windows platforms; adding an `os.name == 'nt'` branch (and a POSIX-appropriate fallback) would avoid creating Windows-style paths on Unix-like systems.

## Individual Comments

### Comment 1
<location path="app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py" line_range="173-177" />
<code_context>
+        self.index.remove_ids(id_array)
+        await self.save_index()
+
+    async def save_index(self) -> None:
+        """保存索引(兼容含非 ASCII 字符的 Windows 路径)"""
+        if self.index is None:
+            return
+        self._write_index(self.index, self.path)
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against `self.path` being `None` before calling `_write_index`.

Since the constructor allows `path: str | None = None`, `save_index` can be called when only the in‑memory index is used and `self.path` is still `None`. In that case `_write_index` receives `None` and `os.path.dirname(path)` raises a `TypeError`. Please early‑return when `self.path` is falsy, e.g.:

```python
async def save_index(self) -> None:
    if self.index is None or not self.path:
        return
    self._write_index(self.index, self.path)
```
</issue_to_address>

### Comment 2
<location path="app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py" line_range="97" />
<code_context>
+
+        写入前先确保目标目录存在,防止 shutil.move 时目录缺失。
+        """
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+
+        tmp = os.path.join(_safe_temp_dir(), f"_faiss_write_{os.getpid()}.faiss")
</code_context>
<issue_to_address>
**issue:** Handle the case where `path` has no directory component to avoid `os.makedirs('')`.

`os.path.dirname(path)` is `''` when `path` is just a filename (e.g. `'index.faiss'`), and `os.makedirs('')` raises `FileNotFoundError`. Add a guard so you only create the directory when it exists:

```python
dirname = os.path.dirname(path)
if dirname:
    os.makedirs(dirname, exist_ok=True)
```
</issue_to_address>

### Comment 3
<location path="app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py" line_range="77-86" />
<code_context>
+        except RuntimeError:
+            pass  # 不吞其他异常类型
+
+        tmp = os.path.join(_safe_temp_dir(), f"_faiss_read_{os.getpid()}.faiss")
+        try:
+            shutil.copy2(path, tmp)
+            return faiss.read_index(tmp)
+        finally:
+            try:
+                os.remove(tmp)
+            except OSError:
+                pass
+
+    @staticmethod
+    def _write_index(index: "faiss.Index", path: str) -> None:
+        """保存 Faiss 索引,兼容含非 ASCII 字符的 Windows 路径。
+
+        先写入纯 ASCII 临时文件,再用 Python shutil.move 移动到位。
+        Python 文件操作使用 Windows wide-char API (CreateFileW),
+        正确支持 Unicode 路径。
+
+        写入前先确保目标目录存在,防止 shutil.move 时目录缺失。
+        """
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+
+        tmp = os.path.join(_safe_temp_dir(), f"_faiss_write_{os.getpid()}.faiss")
+        try:
+            faiss.write_index(index, tmp)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Use per-call unique temp filenames to avoid collisions in concurrent usage.

Both `_read_index` and `_write_index` derive the temp filename only from `os.getpid()`, so concurrent calls in the same process (threads/async tasks) will race on the same file and can corrupt the index. Please generate a unique temporary file per call (e.g. `tempfile.NamedTemporaryFile(delete=False)`, `tempfile.mkstemp()`, or by adding a random/UUID suffix).

Suggested implementation:

```python
import os
import shutil
import tempfile

```

```python
        fd, tmp = tempfile.mkstemp(prefix="_faiss_read_", suffix=".faiss", dir=_safe_temp_dir())
        os.close(fd)
        try:
            shutil.copy2(path, tmp)
            return faiss.read_index(tmp)
        finally:
            try:
                os.remove(tmp)
            except OSError:
                pass

```

```python
        fd, tmp = tempfile.mkstemp(prefix="_faiss_write_", suffix=".faiss", dir=_safe_temp_dir())
        os.close(fd)
        try:
            faiss.write_index(index, tmp)
            # Windows 同盘 move 是原子 rename,跨盘则走 copy+delete
            shutil.move(tmp, path)
        finally:
            try:
                os.remove(tmp)
            except OSError:
                pass

```

If `import os` / `import shutil` are structured differently (e.g. split across multiple lines or grouped with other imports), adjust the first SEARCH/REPLACE block to match the actual import section and just add `import tempfile` alongside the existing imports.
</issue_to_address>

### Comment 4
<location path="app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py" line_range="139-144" />
<code_context>
+            raise ValueError(
+                f"向量维度不匹配, 期望: {self.dimension}, 实际: {vectors.shape[1]}",
+            )
+        self.index.add_with_ids(vectors, np.array(ids))
+        await self.save_index()
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Specify an explicit integer dtype for the ID array passed to Faiss.

Faiss requires ID arrays to be 64‑bit integers. To avoid platform‑dependent dtypes, cast explicitly:

```python
self.index.add_with_ids(vectors, np.array(ids, dtype=np.int64))
```

```suggestion
        if vectors.shape[1] != self.dimension:
            raise ValueError(
                f"向量维度不匹配, 期望: {self.dimension}, 实际: {vectors.shape[1]}",
            )
+        self.index.add_with_ids(vectors, np.array(ids, dtype=np.int64))
+        await self.save_index()
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py Outdated
Comment thread app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py Outdated
Comment thread app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py Outdated
Comment thread app/astrbot/core/db/vec_db/faiss_impl/embedding_storage.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Faiss-based vector storage implementation with a specific workaround for Windows path encoding issues using an ASCII temporary directory bridge. The reviewer identified several critical improvements: ensuring consistent vector normalization between insertion and search, handling input dimensions in the search method, and addressing logic flaws in the temporary directory fallback. Additionally, feedback was provided regarding thread safety in temporary file creation, the need for unit tests, and the importance of offloading blocking I/O operations in asynchronous methods to avoid event loop stalls.


"""
assert self.index is not None, "FAISS index is not initialized."
faiss.normalize_L2(vector)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

search 方法中对查询向量进行了 L2 归一化,但在 insertinsert_batch 中并未对存入的向量进行归一化。这会导致搜索时的距离计算逻辑不一致(将归一化向量与未归一化向量进行比较)。如果目的是计算欧氏距离,请移除此行;如果目的是计算余弦相似度,请确保在插入时也进行归一化,并考虑使用 IndexFlatIP

"""
assert self.index is not None, "FAISS index is not initialized."
faiss.normalize_L2(vector)
distances, indices = self.index.search(vector, k)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Faiss 的 index.search 方法期望输入是一个二维数组(即一批查询向量)。如果传入的 vector 是一维数组,此处会抛出异常。建议使用 reshape(1, -1) 确保输入维度正确。

Suggested change
distances, indices = self.index.search(vector, k)
distances, indices = self.index.search(vector.reshape(1, -1), k)

@@ -0,0 +1,177 @@
try:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

新实现的 Faiss 向量存储功能需要配套相应的单元测试,以确保功能的正确性和稳定性。

References
  1. New functionality, such as handling attachments or new storage implementations, should be accompanied by corresponding unit tests.

Comment on lines +43 to +50
for d in candidates:
if d.isascii() and os.path.isdir(d) and os.access(d, os.W_OK):
return d

# 终极兜底:创建新目录
fallback = r"C:\Windows\Temp"
os.makedirs(fallback, exist_ok=True)
return fallback
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_safe_temp_dir 的 fallback 逻辑存在潜在问题。如果 C:\Windows\Temp 在循环中因为权限不足(os.access(d, os.W_OK) 为 False)被跳过,函数最后依然会返回该路径并尝试 os.makedirs。如果该路径确实不可写,后续的 Faiss 写入操作仍会失败。建议在找不到合适的 ASCII 且可写的目录时抛出异常,或者确保 fallback 逻辑能有效处理权限问题。

except RuntimeError:
pass # 不吞其他异常类型

tmp = os.path.join(_safe_temp_dir(), f"_faiss_read_{os.getpid()}.faiss")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

使用 os.getpid() 构造临时文件名在多线程环境下不安全。建议使用 tempfile.mkstemp 生成唯一的临时文件名。另外,注意到 _read_index_write_index 中都有类似的临时文件创建逻辑,建议根据 Rule 2 将其重构为共享的辅助函数以减少代码重复。

        fd, tmp = tempfile.mkstemp(dir=_safe_temp_dir(), prefix="_faiss_read_", suffix=".faiss")
        os.close(fd)
References
  1. When implementing similar functionality for different cases, refactor the logic into a shared helper function to avoid code duplication.

Comment on lines +99 to +108
tmp = os.path.join(_safe_temp_dir(), f"_faiss_write_{os.getpid()}.faiss")
try:
faiss.write_index(index, tmp)
# Windows 同盘 move 是原子 rename,跨盘则走 copy+delete
shutil.move(tmp, path)
finally:
try:
os.remove(tmp)
except OSError:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这里的临时文件名同样存在多线程冲突风险。此外,shutil.move 在成功移动文件后通常会删除源文件,导致 finally 块中的 os.remove(tmp) 抛出 FileNotFoundError。建议优化逻辑。另外,此处的临时文件创建与 _read_index 中的逻辑相似,建议根据 Rule 2 将其重构为共享的辅助函数以避免代码重复。

Suggested change
tmp = os.path.join(_safe_temp_dir(), f"_faiss_write_{os.getpid()}.faiss")
try:
faiss.write_index(index, tmp)
# Windows 同盘 move 是原子 rename,跨盘则走 copy+delete
shutil.move(tmp, path)
finally:
try:
os.remove(tmp)
except OSError:
pass
fd, tmp = tempfile.mkstemp(dir=_safe_temp_dir(), prefix="_faiss_write_", suffix=".faiss")
os.close(fd)
try:
faiss.write_index(index, tmp)
shutil.move(tmp, path)
finally:
if os.path.exists(tmp):
try:
os.remove(tmp)
except OSError:
pass
References
  1. When implementing similar functionality for different cases, refactor the logic into a shared helper function to avoid code duplication.

Comment on lines +173 to +177
async def save_index(self) -> None:
"""保存索引(兼容含非 ASCII 字符的 Windows 路径)"""
if self.index is None:
return
self._write_index(self.index, self.path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

save_index 是一个异步方法,但它调用的 _write_index 包含同步的磁盘 I/O,这会阻塞 asyncio 事件循环。建议使用 loop.run_in_executor 执行。需要注意的是,根据 Rule 1,虽然同步函数在主循环中是原子的,但一旦移动到 executor 执行,就失去了这种原子性,因此必须对 self.index 的并发访问进行保护。

References
  1. In a single-threaded asyncio event loop, synchronous functions are executed atomically and will not be interrupted by other coroutines. Therefore, they are safe from race conditions when modifying shared state within that block.

@irmia2026 irmia2026 force-pushed the fix/faiss-unicode-path branch from c4af721 to f9a13d8 Compare May 14, 2026 02:55
@irmia2026
Copy link
Copy Markdown
Author

All issues raised by gemini-code-assist and sourcery-ai have been addressed in f9a13d8. Ready for human review. 🙏

@Soulter
Copy link
Copy Markdown
Member

Soulter commented May 14, 2026

Your code path is wrong..

embedding_storage.py is at astrbot/core/db/vec_db/faiss_impl/embedding_storage.py

@irmia2026 irmia2026 force-pushed the fix/faiss-unicode-path branch from f9a13d8 to 374dd8d Compare May 14, 2026 03:56
@irmia2026
Copy link
Copy Markdown
Author

Ah, good catch — that was a leftover app/ prefix from how the file tree looked under my local AstrBot installation. I should have verified against the upstream layout before pushing.

Fixed in the latest force-push: the file is now at astrbot/core/db/vec_db/faiss_impl/embedding_storage.py where it belongs. Thanks for the review! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants