-
Notifications
You must be signed in to change notification settings - Fork 30
Notes on _nl_conll_ner.py #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,12 +31,12 @@ def _download_training_data(): | |
|
|
||
| Returns an iterable over the lines of the concatenated dataset. | ||
| """ | ||
| return (ln for part in ["train", "testa", "testb"] | ||
| return (ln for part in ["train", "testa", "testb"] # | ||
| for ln in urlopen(_BASE_URL + part)) | ||
|
|
||
|
|
||
|
|
||
| def _features(sentence, i): | ||
| def _features(sentence, i): # | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If seqlearn let you return a list rather than an iterable, this function would look a bit nicer IMHO. It's only ever going to be a small list of features I think, so I can't imagine performance being an issue? |
||
| """Baseline named-entity recognition features for i'th token in sentence. | ||
| """ | ||
| word = sentence[i].split()[0] | ||
|
|
@@ -70,7 +70,7 @@ def _train_ner_model(): | |
|
|
||
| def ner(tokens): | ||
| """Baseline NER tagger for Dutch, based on the CoNLL'02 dataset.""" | ||
|
|
||
| # | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing documentation of input and output. |
||
| global _model | ||
|
|
||
| X = [_features(tokens, i) for i in range(len(tokens))] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets downloaded for every session now. Maybe it would be better if we made a local copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's legally tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, then so is downloading it in the first place, unless they have an unusual license. I'm planning to do a legal analysis of xtas anyway, so let's leave this for now and we'll figure it out as part of that.