Base classes

Base classes for the pke module.

class pke.base.LoadFile

The LoadFile class that provides base functions.

add_candidate(words, stems, pos, offset, sentence_id)

Add a keyphrase candidate to the candidates container.

Parameters
  • words (list) – the words (surface form) of the candidate.

  • stems (list) – the stemmed words of the candidate.

  • pos (list) – the Part-Of-Speeches of the words in the candidate.

  • offset (int) – the offset of the first word of the candidate.

  • sentence_id (int) – the sentence id of the candidate.

candidate_filtering(minimum_length=3, minimum_word_size=2, valid_punctuation_marks='-', maximum_word_number=5, only_alphanum=True, pos_blacklist=None)

Filter the candidates containing strings from the stoplist. Only keep the candidates containing alpha-numeric characters (if the non_latin_filter is set to True) and those length exceeds a given number of characters.

Parameters
  • minimum_length (int) – minimum number of characters for a candidate, defaults to 3.

  • minimum_word_size (int) – minimum number of characters for a token to be considered as a valid word, defaults to 2.

  • valid_punctuation_marks (str) – punctuation marks that are valid for a candidate, defaults to ‘-‘.

  • maximum_word_number (int) – maximum length in words of the candidate, defaults to 5.

  • only_alphanum (bool) – filter candidates containing non (latin) alpha-numeric characters, defaults to True.

  • pos_blacklist (list) – list of unwanted Part-Of-Speeches in candidates, defaults to [].

candidates

Keyphrase candidates container (dict of Candidate objects).

get_n_best(n=10, redundancy_removal=False, stemming=False)

Returns the n-best candidates given the weights.

Parameters
  • n (int) – the number of candidates, defaults to 10.

  • redundancy_removal (bool) – whether redundant keyphrases are filtered out from the n-best list, defaults to False.

  • stemming (bool) – whether to extract stems or surface forms (lowercased, first occurring form of candidate), default to False.

grammar_selection(grammar=None)

Select candidates using nltk RegexpParser with a grammar defining noun phrases (NP).

Parameters

grammar (str) – grammar defining POS patterns of NPs.

is_redundant(candidate, prev, minimum_length=1)

Test if one candidate is redundant with respect to a list of already selected candidates. A candidate is considered redundant if it is included in another candidate that is ranked higher in the list.

Parameters
  • candidate (str) – the lexical form of the candidate.

  • prev (list) – the list of already selected candidates (lexical forms).

  • minimum_length (int) – minimum length (in words) of the candidate to be considered, defaults to 1.

language

Language of the input file.

load_document(input, language=None, stoplist=None, normalization='stemming', spacy_model=None)

Loads the content of a document/string/stream in a given language.

Parameters
  • input (str) – input.

  • language (str) – language of the input, defaults to ‘en’.

  • stoplist (list) – custom list of stopwords, defaults to pke.lang.stopwords[language].

  • normalization (str) – word normalization method, defaults to ‘stemming’. Other possible value is ‘none’ for using word surface forms instead of stems/lemmas.

  • spacy_model (spacy.lang) – preloaded spacy model when input is a string.

longest_sequence_selection(key, valid_values)

Select the longest sequences of given POS tags as candidates.

Parameters
  • key (func) – function that given a sentence return an iterable

  • valid_values (set) – the set of valid values, defaults to None.

ngram_selection(n=3)

Select all the n-grams and populate the candidate container.

Parameters

n (int) – the n-gram length, defaults to 3.

normalization

Word normalization method.

sentences

Sentence container (list of Sentence objects).

stoplist

List of stopwords.

weights

Weight container (can be either word or candidate weights).

class pke.data_structures.Candidate

The keyphrase candidate data structure.

lexical_form

the lexical form of the candidate.

offsets

the offsets of the surface forms.

pos_patterns

the Part-Of-Speech patterns of the candidate.

sentence_ids

the sentence id of each surface form.

surface_forms

the surface forms of the candidate.

class pke.data_structures.Sentence(words, pos=[], meta={})

The sentence data structure.

length

length (number of tokens) of the sentence.

meta

meta-information of the sentence.

pos

list of Part-Of-Speeches.

stems

list of stems.

words

list of words (tokens) in the sentence.