Supervised models

Feature-based models

SupervisedLoadFile

class pke.supervised.SupervisedLoadFile

The SupervisedLoadFile class that provides extra base functions for supervised models.

candidate_weighting()

Extract features and classify candidates with default parameters.

classify_candidates(model=None)

Classify the candidates as keyphrase or not keyphrase.

Parameters

model (str) – the path to load the model in pickle format, default to None.

feature_extraction()

Skeleton for feature extraction.

feature_scaling()

Scale features to [0,1].

instances

The instances container.

Kea

class pke.supervised.Kea

Kea keyphrase extraction model.

Parameterized example:

import pke

# 1. create a Kea extractor.
extractor = pke.supervised.Kea()

# 2. load the content of the document.
stoplist = pke.lang.stopwords.get('en')
extractor.load_document(input='path/to/input',
                        language='en',
                        stoplist=stoplist,
                        normalization=None)

# 3. select 1-3 grams that do not start or end with a stopword as
#    candidates. Candidates that contain punctuation marks as words
#    are discarded.
extractor.candidate_selection()

# 4. classify candidates as keyphrase or not keyphrase.
df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
model_file = 'path/to/kea_model'
extractor.candidate_weighting(model_file=model_file,
                              df=df)

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
candidate_selection()

Select 1-3 grams of normalized words as keyphrase candidates. Candidates that start or end with a stopword are discarded. Candidates that contain punctuation marks (from string.punctuation) as words are filtered out.

candidate_weighting(model_file=None, df=None)

Extract features and classify candidates.

Parameters
  • model_file (str) – path to the model file.

  • df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.

feature_extraction(df=None, training=False)

Extract features for each keyphrase candidate. Features are the tf*idf of the candidate and its first occurrence relative to the document.

Parameters
  • df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.

  • training (bool) – indicates whether features are computed for the training set for computing IDF weights, defaults to false.

static train(training_instances, training_classes, model_file)

Train a Naive Bayes classifier and store the model in a file.

Parameters
  • training_instances (list) – list of features.

  • training_classes (list) – list of binary values.

  • model_file (str) – the model output file.

WINGNUS

class pke.supervised.WINGNUS

WINGNUS keyphrase extraction model.

Parameterized example:

import pke

# 1. create a WINGNUS extractor.
extractor = pke.supervised.WINGNUS()

# 2. load the content of the document.
extractor.load_document(input='path/to/input.xml')

# 3. select simplex noun phrases as candidates.
extractor.candidate_selection()

# 4. classify candidates as keyphrase or not keyphrase.
df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
model_file = 'path/to/wingnus_model'
extractor.candidate_weighting(self, model_file=model_file, df=df)

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
candidate_selection(grammar=None)

Select noun phrases (NP) and NP containing a pre-propositional phrase (NP IN NP) as keyphrase candidates.

Parameters

grammar (str) – grammar defining POS patterns of NPs.

candidate_weighting(model_file=None, df=None)

Extract features and classify candidates.

Parameters
  • model_file (str) – path to the model file.

  • df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.

feature_extraction(df=None, training=False, features_set=None)

Extract features for each candidate.

Parameters
  • df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.

  • training (bool) – indicates whether features are computed for the training set for computing IDF weights, defaults to false.

  • features_set (list) – the set of features to use, defaults to [1, 4, 6].

static train(training_instances, training_classes, model_file)

Train a Naive Bayes classifier and store the model in a file.

Parameters
  • training_instances (list) – list of features.

  • training_classes (list) – list of binary values.

  • model_file (str) – the model output file.