Skip to content

0th is tokenized instead of 4th, 5th, 6th etc.. #7

@tbrodbeck

Description

@tbrodbeck

Here is an example of 0th instead of 5th: (2nd line of the tifu_all_tokenized_and_filtered.json)

"selftext_html": "[...] Confuse a 5th grade girl for a boy in front of half of her class. Kids are mean. Sorry Sandra.</strong></p>\n</div><!-- SC_ON -->",
"tldr_tokenized": [
    "confuse",
    "a",
    "0th",
    "grade",
    "girl",
    "for",
    "a",
    "boy",
    "in",
    "front",
    "of",
    "half",
    "of",
    "her",
    "class",
    "kids",
    "are",
    "mean",
    "sorry",
    "sandra",
    "*"
  ],

I guess this is an error or is this intended for some reason?

PS: Additionally, I just realized that the * is erroneous as well, isn't it? It is probably because of the bold text in the original string (see https://www.reddit.com/r/tifu/comments/1ggydk/tifu_by_genderstereotyping/)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions