Supervised models¶

Feature-based models¶

SupervisedLoadFile¶

class pke.supervised.SupervisedLoadFile¶

The SupervisedLoadFile class that provides extra base functions for supervised models.

candidate_weighting()¶: Extract features and classify candidates with default parameters.

classify_candidates(model=None)¶

Classify the candidates as keyphrase or not keyphrase.

Parameters: model (str) – the path to load the model in pickle format, default to None.

feature_extraction()¶: Skeleton for feature extraction.

feature_scaling()¶: Scale features to [0,1].

instances¶: The instances container.

Kea¶

class pke.supervised.Kea¶

Kea keyphrase extraction model.

Parameterized example:

import pke

# 1. create a Kea extractor.
extractor = pke.supervised.Kea()

# 2. load the content of the document.
stoplist = pke.lang.stopwords.get('en')
extractor.load_document(input='path/to/input',
                        language='en',
                        stoplist=stoplist,
                        normalization=None)

# 3. select 1-3 grams that do not start or end with a stopword as
#    candidates. Candidates that contain punctuation marks as words
#    are discarded.
extractor.candidate_selection()

# 4. classify candidates as keyphrase or not keyphrase.
df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
model_file = 'path/to/kea_model'
extractor.candidate_weighting(model_file=model_file,
                              df=df)

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)

candidate_selection()¶: Select 1-3 grams of normalized words as keyphrase candidates. Candidates that start or end with a stopword are discarded. Candidates that contain punctuation marks (from string.punctuation) as words are filtered out.

candidate_weighting(model_file=None, df=None)¶

Extract features and classify candidates.

Parameters

model_file (str) – path to the model file.
df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.

feature_extraction(df=None, training=False)¶

Extract features for each keyphrase candidate. Features are the tf*idf of the candidate and its first occurrence relative to the document.

Parameters

df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.
training (bool) – indicates whether features are computed for the training set for computing IDF weights, defaults to false.

static train(training_instances, training_classes, model_file)¶

Train a Naive Bayes classifier and store the model in a file.

Parameters

training_instances (list) – list of features.
training_classes (list) – list of binary values.
model_file (str) – the model output file.

WINGNUS¶

class pke.supervised.WINGNUS¶

WINGNUS keyphrase extraction model.

Parameterized example:

import pke

# 1. create a WINGNUS extractor.
extractor = pke.supervised.WINGNUS()

# 2. load the content of the document.
extractor.load_document(input='path/to/input.xml')

# 3. select simplex noun phrases as candidates.
extractor.candidate_selection()

# 4. classify candidates as keyphrase or not keyphrase.
df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
model_file = 'path/to/wingnus_model'
extractor.candidate_weighting(self, model_file=model_file, df=df)

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)

candidate_selection(grammar=None)¶

Select noun phrases (NP) and NP containing a pre-propositional phrase (NP IN NP) as keyphrase candidates.

Parameters: grammar (str) – grammar defining POS patterns of NPs.

candidate_weighting(model_file=None, df=None)¶

Extract features and classify candidates.

Parameters

model_file (str) – path to the model file.
df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.

feature_extraction(df=None, training=False, features_set=None)¶

Extract features for each candidate.

Parameters

df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.
training (bool) – indicates whether features are computed for the training set for computing IDF weights, defaults to false.
features_set (list) – the set of features to use, defaults to [1, 4, 6].

static train(training_instances, training_classes, model_file)¶

Train a Naive Bayes classifier and store the model in a file.

Parameters

training_instances (list) – list of features.
training_classes (list) – list of binary values.
model_file (str) – the model output file.

Supervised models¶

Feature-based models¶

SupervisedLoadFile¶

Kea¶

WINGNUS¶

Table of Contents

Previous topic

Next topic

This Page