Self Checks
Dify version
1.14.0-rc1
Plugin version
0.0.49
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
- The Root Cause: Excluding only IMAGE types, but missing DOCUMENT (PDFs)
In the text concatenation logic within _convert_one_message_to_text (around line 1011):
if isinstance(content, list):
content = "".join((c.data for c in content if c.type != PromptMessageContentType.IMAGE))
Here, the developer correctly realized that parsing images into text for token calculation would cause errors, so they deliberately added if c.type != PromptMessageContentType.IMAGE to exclude them.
However, multimodal inputs consist of more than just images! When a user uploads a file of type DOCUMENT (such as a PDF), this if condition evaluates to True (since it is not an IMAGE), causing the code to proceed and read the c.data of the PDF.
- The Impact: Appending massive Base64 strings to the prompt
In Dify's data structures, c.data for file types (including images, PDFs, audio, etc.) typically stores the file's Base64 encoded string (e.g., data:application/pdf;base64,JVBERi0xLjMKJc...).
For a small 2MB PDF file, the Base64 conversion inflates its size by about 33%, resulting in approximately 2.7 million characters.
Consequently, this massive 2.7-million-character Base64 string is blindly concatenated into the prompt text. When this text is subsequently fed into the local GPT-2 tokenizer (_get_num_tokens_by_gpt2), it results in wildly inflated and inaccurate token counts (e.g., millions of tokens for a single small PDF).
✔️ Error log
No response
Self Checks
Dify version
1.14.0-rc1
Plugin version
0.0.49
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
In the text concatenation logic within _convert_one_message_to_text (around line 1011):
Here, the developer correctly realized that parsing images into text for token calculation would cause errors, so they deliberately added if c.type != PromptMessageContentType.IMAGE to exclude them.
However, multimodal inputs consist of more than just images! When a user uploads a file of type DOCUMENT (such as a PDF), this if condition evaluates to True (since it is not an IMAGE), causing the code to proceed and read the c.data of the PDF.
In Dify's data structures, c.data for file types (including images, PDFs, audio, etc.) typically stores the file's Base64 encoded string (e.g., data:application/pdf;base64,JVBERi0xLjMKJc...).
For a small 2MB PDF file, the Base64 conversion inflates its size by about 33%, resulting in approximately 2.7 million characters.
Consequently, this massive 2.7-million-character Base64 string is blindly concatenated into the prompt text. When this text is subsequently fed into the local GPT-2 tokenizer (_get_num_tokens_by_gpt2), it results in wildly inflated and inaccurate token counts (e.g., millions of tokens for a single small PDF).
✔️ Error log
No response