Skip to content

The docling program uses a thread pool to batch convert multiple docx files, but fails to convert several of them. #3364

@ouyangyifan1231-oss

Description

@ouyangyifan1231-oss

Bug

When using docling to convert multiple docx files in a multithreaded environment, failure files may occur. However, for the same files, single-threaded conversion is always successful.

Steps to reproduce

Place 28 docx files in a folder and use the following program to convert them, set the number of threads to 4 or more, Then check the number of markdown files in the output folder, and you will find that some files have not been converted successfully.
Replacing the thread pool with a process pool will yield the same result.

import os
import subprocess
import concurrent.futures
import sys
from pathlib import Path

def convert_file(doc_path, output_dir=None):
    try:
        md_path = str(Path(doc_path).with_suffix('.md'))
        cmd = [
            "docling",
            "--image-export-mode", "referenced",
            "--from", "docx",
            "--to", "md",
            "--output", output_dir or os.path.dirname(md_path),
            doc_path
        ]
        subprocess.run(cmd, check=True, capture_output=True, text=True)
        print(f"✓ {os.path.basename(doc_path)}")
        return True
    except Exception as e:
        print(f"✗ {os.path.basename(doc_path)}: {e}")
        return False

def main():
    if len(sys.argv) < 2:
        print("用法: python script.py <文件夹路径> [输出路径] [线程数]")
        return

    folder = sys.argv[1]
    output_dir = sys.argv[2] if len(sys.argv) > 2 else folder
    workers = int(sys.argv[3]) if len(sys.argv) > 3 else 4

    files = list(Path(folder).rglob("*.docx"))
    if not files:
        print("未找到docx文件")
        return

    print(f"找到 {len(files)} 个文件,开始转换...")

    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
        futures = {executor.submit(convert_file, str(f), output_dir): f for f in files}
        results = [future.result() for future in concurrent.futures.as_completed(futures)]

    success = sum(results)
    print(f"\n完成!成功: {success}, 失败: {len(results)-success}")

if __name__ == "__main__":
    main()

Docling version

docling-2.90.0

Python version

Python 3.12.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocxissue related to docx backend

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions