Skip to content

gh-139489: Add xml.is_valid_text()#149412

Open
serhiy-storchaka wants to merge 2 commits intopython:mainfrom
serhiy-storchaka:xml-is_valid_text-1.0
Open

gh-139489: Add xml.is_valid_text()#149412
serhiy-storchaka wants to merge 2 commits intopython:mainfrom
serhiy-storchaka:xml-is_valid_text-1.0

Conversation

@serhiy-storchaka
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka commented May 5, 2026

@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community Bot commented May 5, 2026

@vstinner
Copy link
Copy Markdown
Member

vstinner commented May 5, 2026

What do you think of adding also a sanitize function which takes a callback? Example:

import re

ILLEGAL_XML_CHARS_RE = re.compile(
    '['
    # Control characters; newline (\x0A and \x0D) and TAB (\x09) are legal
    '\x00-\x08\x0B\x0C\x0E-\x1F'
    # Surrogate characters
    '\uD800-\uDFFF'
    # Special Unicode characters
    '\uFFFE'
    '\uFFFF'
    # Match multiple sequential invalid characters for better efficiency
    ']+')

def sanitize(text, replace_func):
    def callback(regs):
        return replace_func(regs[0])

    return ILLEGAL_XML_CHARS_RE.sub(callback, text)

def escape(text):
    return ''.join(f'\\x{ord(char):02x}' for char in text)

invalid = '\x00'
test = f'a{invalid}b'
print(sanitize(test, lambda text: '#' * len(text)))
print(sanitize(test, escape))

It would be useful for sanitize_xml() of Lib/test/libregrtest/utils.py.

Comment thread Doc/library/xml.rst Outdated
Comment thread Doc/library/xml.rst Outdated
Comment thread Doc/library/xml.rst Outdated
Co-authored-by: Stan Ulbrych <stan@python.org>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@serhiy-storchaka
Copy link
Copy Markdown
Member Author

What do you think of adding also a sanitize function which takes a callback?

I left it for a separate PR. I have similar function in a different branch. But it accepts also the name of the registered error handler ('strict', 'ignore', 'replace, 'backslashreplace', etc) and the callback has different interface. It needs a separate discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants