Skip to content

fix(file_ops): 基于压缩后估计值分卷,避免卷远小于 24.5 MB 上限#252

Open
ocsin1 wants to merge 2 commits into
MistEO:mainfrom
ocsin1:fix/file_ops
Open

fix(file_ops): 基于压缩后估计值分卷,避免卷远小于 24.5 MB 上限#252
ocsin1 wants to merge 2 commits into
MistEO:mainfrom
ocsin1:fix/file_ops

Conversation

@ocsin1

@ocsin1 ocsin1 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary by Sourcery

调整日志导出卷拆分逻辑,基于压缩后的数据而不是原始文件大小来估算体积,从而减少填充不足的卷,同时仍然遵守 24.5 MB 的限制。

Bug Fixes:

  • 通过基于压缩后大小做拆分决策,防止在日志导出过程中创建明显小于配置上限 24.5 MB 的卷。

Enhancements:

  • 添加启发式和基于预压缩的单文件压缩大小估算,使 zip 卷大小计算更精确,同时不会超过最大卷大小限制。
Original summary in English

Summary by Sourcery

Adjust log export volume splitting to estimate sizes based on compressed data rather than raw file size, reducing underfilled volumes while still respecting the 24.5 MB limit.

Bug Fixes:

  • Prevent creating volumes that are significantly smaller than the configured 24.5 MB upper limit during log export by basing splitting decisions on compressed size.

Enhancements:

  • Add heuristic and pre-compression based estimation of per-file compressed size to make zip volume sizing more accurate without exceeding the maximum volume size.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - 我发现了 3 个问题,并给出了一些总体反馈:

  • pre_compress_measure 的实现会把整个文件读入内存中的 Vec<u8>,对于较大的边界文件来说可能会有问题;可以考虑改为流式写入一个只负责计数字节数的 Write 实现(或者直接丢弃数据的 sink),从而避免在内存中保存压缩结果。
  • estimate_compressed_upper_bound 中,每次调用都把扩展名转换为小写 String 会在热点路径上增加分配开销;你可以通过直接在原始的 OsStr 上做不区分大小写的比较(例如使用 eq_ignore_ascii_case)来避免这一点,从而降低在计算卷大小时每个条目的开销。
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `pre_compress_measure` implementation reads the entire file into a `Vec<u8>` in memory, which can be problematic for large boundary files; consider streaming into a `Write` implementation that just counts bytes (or a sink that discards data) to avoid holding the compressed result in memory.
- In `estimate_compressed_upper_bound`, converting the extension to a lowercase `String` on every call adds allocation overhead in a hot path; you could avoid this by matching on the raw `OsStr` with case-insensitive comparison (e.g., `eq_ignore_ascii_case`) to reduce per-entry cost during volume calculation.

## Individual Comments

### Comment 1
<location path="src-tauri/src/commands/file_ops.rs" line_range="101-110" />
<code_context>
-    // 文件名在 local header 和中央目录条目里都出现一次,所以 ×2。
-    Some(file_size + ZIP_PER_ENTRY_HEADER_UPPER_BOUND + name_len.saturating_mul(2))
+/// 按文件类型估算压缩后的保守上界。图片不压缩,文本除以 4(实测 8-25x)。
+fn estimate_compressed_upper_bound(path: &Path, file_size: u64) -> u64 {
+    let ext = path
+        .extension()
+        .and_then(|e| e.to_str())
+        .unwrap_or("")
+        .to_lowercase();
+    match ext.as_str() {
+        // 已压缩图片:DEFLATE 无法进一步压缩
+        "png" | "jpg" | "jpeg" => file_size,
+        // 文本文件:保守假设 4x 压缩(实测 log 10-25x,json 5-15x)
+        "log" | "json" | "txt" | "toml" | "yaml" | "yml" | "xml" | "csv" => {
+            file_size.saturating_div(4)
+        }
+        // 未知类型保守假设 2x
+        _ => file_size.saturating_div(2),
+    }
</code_context>
<issue_to_address>
**issue (bug_risk):** Compressed-size estimation may underestimate for already-compressed or incompressible files.

Because this value is used as a capacity check against `MAX_VOLUME_BYTES`, underestimating here means volumes can still exceed the limit slightly, especially for PNG/JPEG and other poorly-compressible/binary formats where DEFLATE can grow the data. It would be safer to bias this estimate upward (e.g., add a small constant overhead or ensure `max(file_size, file_size.saturating_div(2))`) so the bound is truly conservative for incompressible types.
</issue_to_address>

### Comment 2
<location path="src-tauri/src/commands/file_ops.rs" line_range="631-638" />
<code_context>
+                .unwrap_or(0);
+
+            // 用保守上界估算:如果这都放得下,直接写。
+            let est_delta = estimate_compressed_upper_bound(&entry.source_path, file_size);
+            let current_total = counter
+                .load(Ordering::Relaxed)
+                .saturating_add(central_dir_reserve);
+            if wrote_any
+                && current_total
+                    .saturating_add(est_delta)
+                    .saturating_add(entry_cd_bytes)
+                    > MAX_VOLUME_BYTES
+            {
</code_context>
<issue_to_address>
**issue (bug_risk):** Local header + DEFLATE overhead isn't included in the size projection, risking slight volume overruns.

In the previous implementation, `estimate_entry_upper_bound` included `ZIP_PER_ENTRY_HEADER_UPPER_BOUND`, but here `est_delta`/`exact_delta` cover only the compressed payload, and `current_total` only adds central dir + EOCD reserves. As a result, local headers, data descriptors, and DEFLATE framing are excluded from the projection, so the archive can exceed `MAX_VOLUME_BYTES` in boundary cases. Please reintroduce an approximate per-entry overhead term (either into `est_delta`/`exact_delta` or directly into the projection) so the limit reflects the full on-disk size.
</issue_to_address>

### Comment 3
<location path="src-tauri/src/commands/file_ops.rs" line_range="623-628" />
<code_context>
-                break;
+            let entry_cd_bytes =
+                ZIP_CENTRAL_DIR_FIXED_BYTES + entry.archive_name.len() as u64;
+            let file_size = entry
+                .source_path
+                .metadata()
+                .ok()
+                .map(|m| m.len())
+                .unwrap_or(0);
+
+            // 用保守上界估算:如果这都放得下,直接写。
</code_context>
<issue_to_address>
**issue (bug_risk):** Treating metadata errors as size 0 can under-estimate the projected volume size.

Using `unwrap_or(0)` here means IO failures are treated as zero-sized files, which biases the volume-size estimate downward and can let oversized entries slip through the pre-compression check. Instead, consider either using a conservative upper bound (e.g., forcing a new volume or a large sentinel size) or handling the IO error like `pre_compress_measure` failures so we don’t silently under-estimate volume size.
</issue_to_address>

Sourcery 对开源项目是免费的——如果你喜欢我们的代码审查,请考虑分享它 ✨
帮我变得更有用!请在每条评论上点 👍 或 👎,我会根据这些反馈改进以后的代码审查。
Original comment in English

Hey - I've found 3 issues, and left some high level feedback:

  • The pre_compress_measure implementation reads the entire file into a Vec<u8> in memory, which can be problematic for large boundary files; consider streaming into a Write implementation that just counts bytes (or a sink that discards data) to avoid holding the compressed result in memory.
  • In estimate_compressed_upper_bound, converting the extension to a lowercase String on every call adds allocation overhead in a hot path; you could avoid this by matching on the raw OsStr with case-insensitive comparison (e.g., eq_ignore_ascii_case) to reduce per-entry cost during volume calculation.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `pre_compress_measure` implementation reads the entire file into a `Vec<u8>` in memory, which can be problematic for large boundary files; consider streaming into a `Write` implementation that just counts bytes (or a sink that discards data) to avoid holding the compressed result in memory.
- In `estimate_compressed_upper_bound`, converting the extension to a lowercase `String` on every call adds allocation overhead in a hot path; you could avoid this by matching on the raw `OsStr` with case-insensitive comparison (e.g., `eq_ignore_ascii_case`) to reduce per-entry cost during volume calculation.

## Individual Comments

### Comment 1
<location path="src-tauri/src/commands/file_ops.rs" line_range="101-110" />
<code_context>
-    // 文件名在 local header 和中央目录条目里都出现一次,所以 ×2。
-    Some(file_size + ZIP_PER_ENTRY_HEADER_UPPER_BOUND + name_len.saturating_mul(2))
+/// 按文件类型估算压缩后的保守上界。图片不压缩,文本除以 4(实测 8-25x)。
+fn estimate_compressed_upper_bound(path: &Path, file_size: u64) -> u64 {
+    let ext = path
+        .extension()
+        .and_then(|e| e.to_str())
+        .unwrap_or("")
+        .to_lowercase();
+    match ext.as_str() {
+        // 已压缩图片:DEFLATE 无法进一步压缩
+        "png" | "jpg" | "jpeg" => file_size,
+        // 文本文件:保守假设 4x 压缩(实测 log 10-25x,json 5-15x)
+        "log" | "json" | "txt" | "toml" | "yaml" | "yml" | "xml" | "csv" => {
+            file_size.saturating_div(4)
+        }
+        // 未知类型保守假设 2x
+        _ => file_size.saturating_div(2),
+    }
</code_context>
<issue_to_address>
**issue (bug_risk):** Compressed-size estimation may underestimate for already-compressed or incompressible files.

Because this value is used as a capacity check against `MAX_VOLUME_BYTES`, underestimating here means volumes can still exceed the limit slightly, especially for PNG/JPEG and other poorly-compressible/binary formats where DEFLATE can grow the data. It would be safer to bias this estimate upward (e.g., add a small constant overhead or ensure `max(file_size, file_size.saturating_div(2))`) so the bound is truly conservative for incompressible types.
</issue_to_address>

### Comment 2
<location path="src-tauri/src/commands/file_ops.rs" line_range="631-638" />
<code_context>
+                .unwrap_or(0);
+
+            // 用保守上界估算:如果这都放得下,直接写。
+            let est_delta = estimate_compressed_upper_bound(&entry.source_path, file_size);
+            let current_total = counter
+                .load(Ordering::Relaxed)
+                .saturating_add(central_dir_reserve);
+            if wrote_any
+                && current_total
+                    .saturating_add(est_delta)
+                    .saturating_add(entry_cd_bytes)
+                    > MAX_VOLUME_BYTES
+            {
</code_context>
<issue_to_address>
**issue (bug_risk):** Local header + DEFLATE overhead isn't included in the size projection, risking slight volume overruns.

In the previous implementation, `estimate_entry_upper_bound` included `ZIP_PER_ENTRY_HEADER_UPPER_BOUND`, but here `est_delta`/`exact_delta` cover only the compressed payload, and `current_total` only adds central dir + EOCD reserves. As a result, local headers, data descriptors, and DEFLATE framing are excluded from the projection, so the archive can exceed `MAX_VOLUME_BYTES` in boundary cases. Please reintroduce an approximate per-entry overhead term (either into `est_delta`/`exact_delta` or directly into the projection) so the limit reflects the full on-disk size.
</issue_to_address>

### Comment 3
<location path="src-tauri/src/commands/file_ops.rs" line_range="623-628" />
<code_context>
-                break;
+            let entry_cd_bytes =
+                ZIP_CENTRAL_DIR_FIXED_BYTES + entry.archive_name.len() as u64;
+            let file_size = entry
+                .source_path
+                .metadata()
+                .ok()
+                .map(|m| m.len())
+                .unwrap_or(0);
+
+            // 用保守上界估算:如果这都放得下,直接写。
</code_context>
<issue_to_address>
**issue (bug_risk):** Treating metadata errors as size 0 can under-estimate the projected volume size.

Using `unwrap_or(0)` here means IO failures are treated as zero-sized files, which biases the volume-size estimate downward and can let oversized entries slip through the pre-compression check. Instead, consider either using a conservative upper bound (e.g., forcing a new volume or a large sentinel size) or handling the IO error like `pre_compress_measure` failures so we don’t silently under-estimate volume size.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src-tauri/src/commands/file_ops.rs Outdated
Comment thread src-tauri/src/commands/file_ops.rs Outdated
Comment thread src-tauri/src/commands/file_ops.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant