Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions xtas/tasks/_nl_conll_ner.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,12 @@ def _download_training_data():

Returns an iterable over the lines of the concatenated dataset.
"""
return (ln for part in ["train", "testa", "testb"]
return (ln for part in ["train", "testa", "testb"] #

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets downloaded for every session now. Maybe it would be better if we made a local copy?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's legally tricky.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, then so is downloading it in the first place, unless they have an unusual license. I'm planning to do a legal analysis of xtas anyway, so let's leave this for now and we'll figure it out as part of that.

for ln in urlopen(_BASE_URL + part))



def _features(sentence, i):
def _features(sentence, i): #

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If seqlearn let you return a list rather than an iterable, this function would look a bit nicer IMHO. It's only ever going to be a small list of features I think, so I can't imagine performance being an issue?

"""Baseline named-entity recognition features for i'th token in sentence.
"""
word = sentence[i].split()[0]
Expand Down Expand Up @@ -70,7 +70,7 @@ def _train_ner_model():

def ner(tokens):
"""Baseline NER tagger for Dutch, based on the CoNLL'02 dataset."""

#

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation of input and output.

global _model

X = [_features(tokens, i) for i in range(len(tokens))]
Expand Down