Bug
When using docling to convert multiple docx files in a multithreaded environment, failure files may occur. However, for the same files, single-threaded conversion is always successful.
Steps to reproduce
Place 28 docx files in a folder and use the following program to convert them, set the number of threads to 4 or more, Then check the number of markdown files in the output folder, and you will find that some files have not been converted successfully.
Replacing the thread pool with a process pool will yield the same result.
import os
import subprocess
import concurrent.futures
import sys
from pathlib import Path
def convert_file(doc_path, output_dir=None):
try:
md_path = str(Path(doc_path).with_suffix('.md'))
cmd = [
"docling",
"--image-export-mode", "referenced",
"--from", "docx",
"--to", "md",
"--output", output_dir or os.path.dirname(md_path),
doc_path
]
subprocess.run(cmd, check=True, capture_output=True, text=True)
print(f"✓ {os.path.basename(doc_path)}")
return True
except Exception as e:
print(f"✗ {os.path.basename(doc_path)}: {e}")
return False
def main():
if len(sys.argv) < 2:
print("用法: python script.py <文件夹路径> [输出路径] [线程数]")
return
folder = sys.argv[1]
output_dir = sys.argv[2] if len(sys.argv) > 2 else folder
workers = int(sys.argv[3]) if len(sys.argv) > 3 else 4
files = list(Path(folder).rglob("*.docx"))
if not files:
print("未找到docx文件")
return
print(f"找到 {len(files)} 个文件,开始转换...")
with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
futures = {executor.submit(convert_file, str(f), output_dir): f for f in files}
results = [future.result() for future in concurrent.futures.as_completed(futures)]
success = sum(results)
print(f"\n完成!成功: {success}, 失败: {len(results)-success}")
if __name__ == "__main__":
main()
Docling version
docling-2.90.0
Python version
Python 3.12.12
Bug
When using docling to convert multiple docx files in a multithreaded environment, failure files may occur. However, for the same files, single-threaded conversion is always successful.
Steps to reproduce
Place 28 docx files in a folder and use the following program to convert them, set the number of threads to 4 or more, Then check the number of markdown files in the output folder, and you will find that some files have not been converted successfully.
Replacing the thread pool with a process pool will yield the same result.
Docling version
docling-2.90.0
Python version
Python 3.12.12