fix(file_ops): 基于压缩后估计值分卷,避免卷远小于 24.5 MB 上限#252
Open
ocsin1 wants to merge 2 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Hey - 我发现了 3 个问题,并给出了一些总体反馈:
pre_compress_measure的实现会把整个文件读入内存中的Vec<u8>,对于较大的边界文件来说可能会有问题;可以考虑改为流式写入一个只负责计数字节数的Write实现(或者直接丢弃数据的 sink),从而避免在内存中保存压缩结果。- 在
estimate_compressed_upper_bound中,每次调用都把扩展名转换为小写String会在热点路径上增加分配开销;你可以通过直接在原始的OsStr上做不区分大小写的比较(例如使用eq_ignore_ascii_case)来避免这一点,从而降低在计算卷大小时每个条目的开销。
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `pre_compress_measure` implementation reads the entire file into a `Vec<u8>` in memory, which can be problematic for large boundary files; consider streaming into a `Write` implementation that just counts bytes (or a sink that discards data) to avoid holding the compressed result in memory.
- In `estimate_compressed_upper_bound`, converting the extension to a lowercase `String` on every call adds allocation overhead in a hot path; you could avoid this by matching on the raw `OsStr` with case-insensitive comparison (e.g., `eq_ignore_ascii_case`) to reduce per-entry cost during volume calculation.
## Individual Comments
### Comment 1
<location path="src-tauri/src/commands/file_ops.rs" line_range="101-110" />
<code_context>
- // 文件名在 local header 和中央目录条目里都出现一次,所以 ×2。
- Some(file_size + ZIP_PER_ENTRY_HEADER_UPPER_BOUND + name_len.saturating_mul(2))
+/// 按文件类型估算压缩后的保守上界。图片不压缩,文本除以 4(实测 8-25x)。
+fn estimate_compressed_upper_bound(path: &Path, file_size: u64) -> u64 {
+ let ext = path
+ .extension()
+ .and_then(|e| e.to_str())
+ .unwrap_or("")
+ .to_lowercase();
+ match ext.as_str() {
+ // 已压缩图片:DEFLATE 无法进一步压缩
+ "png" | "jpg" | "jpeg" => file_size,
+ // 文本文件:保守假设 4x 压缩(实测 log 10-25x,json 5-15x)
+ "log" | "json" | "txt" | "toml" | "yaml" | "yml" | "xml" | "csv" => {
+ file_size.saturating_div(4)
+ }
+ // 未知类型保守假设 2x
+ _ => file_size.saturating_div(2),
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Compressed-size estimation may underestimate for already-compressed or incompressible files.
Because this value is used as a capacity check against `MAX_VOLUME_BYTES`, underestimating here means volumes can still exceed the limit slightly, especially for PNG/JPEG and other poorly-compressible/binary formats where DEFLATE can grow the data. It would be safer to bias this estimate upward (e.g., add a small constant overhead or ensure `max(file_size, file_size.saturating_div(2))`) so the bound is truly conservative for incompressible types.
</issue_to_address>
### Comment 2
<location path="src-tauri/src/commands/file_ops.rs" line_range="631-638" />
<code_context>
+ .unwrap_or(0);
+
+ // 用保守上界估算:如果这都放得下,直接写。
+ let est_delta = estimate_compressed_upper_bound(&entry.source_path, file_size);
+ let current_total = counter
+ .load(Ordering::Relaxed)
+ .saturating_add(central_dir_reserve);
+ if wrote_any
+ && current_total
+ .saturating_add(est_delta)
+ .saturating_add(entry_cd_bytes)
+ > MAX_VOLUME_BYTES
+ {
</code_context>
<issue_to_address>
**issue (bug_risk):** Local header + DEFLATE overhead isn't included in the size projection, risking slight volume overruns.
In the previous implementation, `estimate_entry_upper_bound` included `ZIP_PER_ENTRY_HEADER_UPPER_BOUND`, but here `est_delta`/`exact_delta` cover only the compressed payload, and `current_total` only adds central dir + EOCD reserves. As a result, local headers, data descriptors, and DEFLATE framing are excluded from the projection, so the archive can exceed `MAX_VOLUME_BYTES` in boundary cases. Please reintroduce an approximate per-entry overhead term (either into `est_delta`/`exact_delta` or directly into the projection) so the limit reflects the full on-disk size.
</issue_to_address>
### Comment 3
<location path="src-tauri/src/commands/file_ops.rs" line_range="623-628" />
<code_context>
- break;
+ let entry_cd_bytes =
+ ZIP_CENTRAL_DIR_FIXED_BYTES + entry.archive_name.len() as u64;
+ let file_size = entry
+ .source_path
+ .metadata()
+ .ok()
+ .map(|m| m.len())
+ .unwrap_or(0);
+
+ // 用保守上界估算:如果这都放得下,直接写。
</code_context>
<issue_to_address>
**issue (bug_risk):** Treating metadata errors as size 0 can under-estimate the projected volume size.
Using `unwrap_or(0)` here means IO failures are treated as zero-sized files, which biases the volume-size estimate downward and can let oversized entries slip through the pre-compression check. Instead, consider either using a conservative upper bound (e.g., forcing a new volume or a large sentinel size) or handling the IO error like `pre_compress_measure` failures so we don’t silently under-estimate volume size.
</issue_to_address>帮我变得更有用!请在每条评论上点 👍 或 👎,我会根据这些反馈改进以后的代码审查。
Original comment in English
Hey - I've found 3 issues, and left some high level feedback:
- The
pre_compress_measureimplementation reads the entire file into aVec<u8>in memory, which can be problematic for large boundary files; consider streaming into aWriteimplementation that just counts bytes (or a sink that discards data) to avoid holding the compressed result in memory. - In
estimate_compressed_upper_bound, converting the extension to a lowercaseStringon every call adds allocation overhead in a hot path; you could avoid this by matching on the rawOsStrwith case-insensitive comparison (e.g.,eq_ignore_ascii_case) to reduce per-entry cost during volume calculation.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `pre_compress_measure` implementation reads the entire file into a `Vec<u8>` in memory, which can be problematic for large boundary files; consider streaming into a `Write` implementation that just counts bytes (or a sink that discards data) to avoid holding the compressed result in memory.
- In `estimate_compressed_upper_bound`, converting the extension to a lowercase `String` on every call adds allocation overhead in a hot path; you could avoid this by matching on the raw `OsStr` with case-insensitive comparison (e.g., `eq_ignore_ascii_case`) to reduce per-entry cost during volume calculation.
## Individual Comments
### Comment 1
<location path="src-tauri/src/commands/file_ops.rs" line_range="101-110" />
<code_context>
- // 文件名在 local header 和中央目录条目里都出现一次,所以 ×2。
- Some(file_size + ZIP_PER_ENTRY_HEADER_UPPER_BOUND + name_len.saturating_mul(2))
+/// 按文件类型估算压缩后的保守上界。图片不压缩,文本除以 4(实测 8-25x)。
+fn estimate_compressed_upper_bound(path: &Path, file_size: u64) -> u64 {
+ let ext = path
+ .extension()
+ .and_then(|e| e.to_str())
+ .unwrap_or("")
+ .to_lowercase();
+ match ext.as_str() {
+ // 已压缩图片:DEFLATE 无法进一步压缩
+ "png" | "jpg" | "jpeg" => file_size,
+ // 文本文件:保守假设 4x 压缩(实测 log 10-25x,json 5-15x)
+ "log" | "json" | "txt" | "toml" | "yaml" | "yml" | "xml" | "csv" => {
+ file_size.saturating_div(4)
+ }
+ // 未知类型保守假设 2x
+ _ => file_size.saturating_div(2),
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Compressed-size estimation may underestimate for already-compressed or incompressible files.
Because this value is used as a capacity check against `MAX_VOLUME_BYTES`, underestimating here means volumes can still exceed the limit slightly, especially for PNG/JPEG and other poorly-compressible/binary formats where DEFLATE can grow the data. It would be safer to bias this estimate upward (e.g., add a small constant overhead or ensure `max(file_size, file_size.saturating_div(2))`) so the bound is truly conservative for incompressible types.
</issue_to_address>
### Comment 2
<location path="src-tauri/src/commands/file_ops.rs" line_range="631-638" />
<code_context>
+ .unwrap_or(0);
+
+ // 用保守上界估算:如果这都放得下,直接写。
+ let est_delta = estimate_compressed_upper_bound(&entry.source_path, file_size);
+ let current_total = counter
+ .load(Ordering::Relaxed)
+ .saturating_add(central_dir_reserve);
+ if wrote_any
+ && current_total
+ .saturating_add(est_delta)
+ .saturating_add(entry_cd_bytes)
+ > MAX_VOLUME_BYTES
+ {
</code_context>
<issue_to_address>
**issue (bug_risk):** Local header + DEFLATE overhead isn't included in the size projection, risking slight volume overruns.
In the previous implementation, `estimate_entry_upper_bound` included `ZIP_PER_ENTRY_HEADER_UPPER_BOUND`, but here `est_delta`/`exact_delta` cover only the compressed payload, and `current_total` only adds central dir + EOCD reserves. As a result, local headers, data descriptors, and DEFLATE framing are excluded from the projection, so the archive can exceed `MAX_VOLUME_BYTES` in boundary cases. Please reintroduce an approximate per-entry overhead term (either into `est_delta`/`exact_delta` or directly into the projection) so the limit reflects the full on-disk size.
</issue_to_address>
### Comment 3
<location path="src-tauri/src/commands/file_ops.rs" line_range="623-628" />
<code_context>
- break;
+ let entry_cd_bytes =
+ ZIP_CENTRAL_DIR_FIXED_BYTES + entry.archive_name.len() as u64;
+ let file_size = entry
+ .source_path
+ .metadata()
+ .ok()
+ .map(|m| m.len())
+ .unwrap_or(0);
+
+ // 用保守上界估算:如果这都放得下,直接写。
</code_context>
<issue_to_address>
**issue (bug_risk):** Treating metadata errors as size 0 can under-estimate the projected volume size.
Using `unwrap_or(0)` here means IO failures are treated as zero-sized files, which biases the volume-size estimate downward and can let oversized entries slip through the pre-compression check. Instead, consider either using a conservative upper bound (e.g., forcing a new volume or a large sentinel size) or handling the IO error like `pre_compress_measure` failures so we don’t silently under-estimate volume size.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by Sourcery
调整日志导出卷拆分逻辑,基于压缩后的数据而不是原始文件大小来估算体积,从而减少填充不足的卷,同时仍然遵守 24.5 MB 的限制。
Bug Fixes:
Enhancements:
Original summary in English
Summary by Sourcery
Adjust log export volume splitting to estimate sizes based on compressed data rather than raw file size, reducing underfilled volumes while still respecting the 24.5 MB limit.
Bug Fixes:
Enhancements: