Supervised models¶
Feature-based models¶
SupervisedLoadFile¶
- class pke.supervised.SupervisedLoadFile¶
The SupervisedLoadFile class that provides extra base functions for supervised models.
- candidate_weighting()¶
Extract features and classify candidates with default parameters.
- classify_candidates(model=None)¶
Classify the candidates as keyphrase or not keyphrase.
- Parameters
model (str) – the path to load the model in pickle format, default to None.
- feature_extraction()¶
Skeleton for feature extraction.
- feature_scaling()¶
Scale features to [0,1].
- instances¶
The instances container.
Kea¶
- class pke.supervised.Kea¶
Kea keyphrase extraction model.
Parameterized example:
import pke # 1. create a Kea extractor. extractor = pke.supervised.Kea() # 2. load the content of the document. stoplist = pke.lang.stopwords.get('en') extractor.load_document(input='path/to/input', language='en', stoplist=stoplist, normalization=None) # 3. select 1-3 grams that do not start or end with a stopword as # candidates. Candidates that contain punctuation marks as words # are discarded. extractor.candidate_selection() # 4. classify candidates as keyphrase or not keyphrase. df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz') model_file = 'path/to/kea_model' extractor.candidate_weighting(model_file=model_file, df=df) # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- candidate_selection()¶
Select 1-3 grams of normalized words as keyphrase candidates. Candidates that start or end with a stopword are discarded. Candidates that contain punctuation marks (from string.punctuation) as words are filtered out.
- candidate_weighting(model_file=None, df=None)¶
Extract features and classify candidates.
- Parameters
model_file (str) – path to the model file.
df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.
- feature_extraction(df=None, training=False)¶
Extract features for each keyphrase candidate. Features are the tf*idf of the candidate and its first occurrence relative to the document.
- Parameters
df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.
training (bool) – indicates whether features are computed for the training set for computing IDF weights, defaults to false.
- static train(training_instances, training_classes, model_file)¶
Train a Naive Bayes classifier and store the model in a file.
- Parameters
training_instances (list) – list of features.
training_classes (list) – list of binary values.
model_file (str) – the model output file.
WINGNUS¶
- class pke.supervised.WINGNUS¶
WINGNUS keyphrase extraction model.
Parameterized example:
import pke # 1. create a WINGNUS extractor. extractor = pke.supervised.WINGNUS() # 2. load the content of the document. extractor.load_document(input='path/to/input.xml') # 3. select simplex noun phrases as candidates. extractor.candidate_selection() # 4. classify candidates as keyphrase or not keyphrase. df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz') model_file = 'path/to/wingnus_model' extractor.candidate_weighting(self, model_file=model_file, df=df) # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- candidate_selection(grammar=None)¶
Select noun phrases (NP) and NP containing a pre-propositional phrase (NP IN NP) as keyphrase candidates.
- Parameters
grammar (str) – grammar defining POS patterns of NPs.
- candidate_weighting(model_file=None, df=None)¶
Extract features and classify candidates.
- Parameters
model_file (str) – path to the model file.
df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.
- feature_extraction(df=None, training=False, features_set=None)¶
Extract features for each candidate.
- Parameters
df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.
training (bool) – indicates whether features are computed for the training set for computing IDF weights, defaults to false.
features_set (list) – the set of features to use, defaults to [1, 4, 6].
- static train(training_instances, training_classes, model_file)¶
Train a Naive Bayes classifier and store the model in a file.
- Parameters
training_instances (list) – list of features.
training_classes (list) – list of binary values.
model_file (str) – the model output file.