Skip to content

Implement the Tokenizer Metrics Script #21

Description

@Malikeh97
  • Apply DataTrove Data Segmenters for Language-Specific Word Tokenization

  • Create a list of candidate datasets for fertility, parity, and PCW analysis

  • Add stand alone functions for metric evaluation and visualization for the paper

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions