Skip to content

[draft] regex module #5

@clay-arras

Description

@clay-arras

TODO

use the builtin cpp regex library. the regex pattern will probably just be stolen from a good open source model.
example: for deepseek, https://huggingface.co/deepseek-ai/DeepSeek-R1/raw/main/tokenizer.json
here is the pattern I believe: "[!\"#$%&'()*+,\\-./:;<=>?@\\[\\\\\\]^_`{|}~][A-Za-z]+|[^\r\n\\p{L}\\p{P}\\p{S}]?[\\p{L}\\p{M}]+| ?[\\p{P}\\p{S}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"

chunking:

  • split the text into chunks of 1024 characters
  • we process the chunks iteratively. each chunks will be split according to the regex pattern.
  • we carry over a chunk from the previous iteration, min(lastSplitChunk, 512)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions