Skip to content

Theory Question: Collocater vs. nltk Collocation #3

Description

@ChasNelson1990

Hi there,

First up, apologies if this is a stupid question - I'm not an NLP person and some of the language and ideas are brand new to me.

So, as I understand it, collocation is the idea of commonly occurring sequences of words. Prior to actually looking into NLP this week, I would call this n-grams and I think the NLTK agrees with me. The NLTK collocations functions primarily look for n-grams, do some filtering and return those (see https://github.com/nltk/nltk/blob/develop/nltk/collocations.py).

So, if I run nltk's collocation on (as an example) The Hound of the Baskervilles, I get phrases like 'Mr. Holmes', 'Grimpen Mire', 'escaped convict' and 'missing boot' - these all seem pretty reasonable given the plot.

But if I run your collocater pipeline I get very different results (and it takes significantly longer to process). Key differences that I can see being: no proper nouns, I get duplicate entries and those duplicates aren't equal, e.g. I have several 'different' 'look at's returned.

So, I think the lack of proper nouns is caused by the fact that you're determining collocations from a collocation dictionary so words like 'Sherlock' will never be processed.

The duplicate entries I think roughly corresponds to the number of times that collocation occurs and the fact that duplicate entries aren't equal is presumably down to the SpaCy vectors on those tokens being non-equal.

So, my first question is: what are you actually doing to determine these collocations? Why do you need to refer to a dictionary source in order to extract these?

I have a series of follow-up questions that are more about implementation than linguistics algorithms but I think I need to understand the linguistic rationale before I start suggesting technical changes.

Hope you don't mind me reaching out like this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions