Base classes¶

Base classes for the pke module.

class pke.base.LoadFile¶

The LoadFile class that provides base functions.

add_candidate(words, stems, pos, offset, sentence_id)¶

Add a keyphrase candidate to the candidates container.

Parameters

words (list) – the words (surface form) of the candidate.
stems (list) – the stemmed words of the candidate.
pos (list) – the Part-Of-Speeches of the words in the candidate.
offset (int) – the offset of the first word of the candidate.
sentence_id (int) – the sentence id of the candidate.

candidate_filtering(minimum_length=3, minimum_word_size=2, valid_punctuation_marks='-', maximum_word_number=5, only_alphanum=True, pos_blacklist=None)¶

Filter the candidates containing strings from the stoplist. Only keep the candidates containing alpha-numeric characters (if the non_latin_filter is set to True) and those length exceeds a given number of characters.

Parameters

minimum_length (int) – minimum number of characters for a candidate, defaults to 3.
minimum_word_size (int) – minimum number of characters for a token to be considered as a valid word, defaults to 2.
valid_punctuation_marks (str) – punctuation marks that are valid for a candidate, defaults to ‘-‘.
maximum_word_number (int) – maximum length in words of the candidate, defaults to 5.
only_alphanum (bool) – filter candidates containing non (latin) alpha-numeric characters, defaults to True.
pos_blacklist (list) – list of unwanted Part-Of-Speeches in candidates, defaults to [].

candidates¶: Keyphrase candidates container (dict of Candidate objects).

get_n_best(n=10, redundancy_removal=False, stemming=False)¶

Returns the n-best candidates given the weights.

Parameters

n (int) – the number of candidates, defaults to 10.
redundancy_removal (bool) – whether redundant keyphrases are filtered out from the n-best list, defaults to False.
stemming (bool) – whether to extract stems or surface forms (lowercased, first occurring form of candidate), default to False.

grammar_selection(grammar=None)¶

Select candidates using nltk RegexpParser with a grammar defining noun phrases (NP).

Parameters: grammar (str) – grammar defining POS patterns of NPs.

is_redundant(candidate, prev, minimum_length=1)¶

Test if one candidate is redundant with respect to a list of already selected candidates. A candidate is considered redundant if it is included in another candidate that is ranked higher in the list.

Parameters

candidate (str) – the lexical form of the candidate.
prev (list) – the list of already selected candidates (lexical forms).
minimum_length (int) – minimum length (in words) of the candidate to be considered, defaults to 1.

language¶: Language of the input file.

load_document(input, language=None, stoplist=None, normalization='stemming', spacy_model=None)¶

Loads the content of a document/string/stream in a given language.

Parameters

input (str) – input.
language (str) – language of the input, defaults to ‘en’.
stoplist (list) – custom list of stopwords, defaults to pke.lang.stopwords[language].
normalization (str) – word normalization method, defaults to ‘stemming’. Other possible value is ‘none’ for using word surface forms instead of stems/lemmas.
spacy_model (spacy.lang) – preloaded spacy model when input is a string.

longest_sequence_selection(key, valid_values)¶

Select the longest sequences of given POS tags as candidates.

Parameters

key (func) – function that given a sentence return an iterable
valid_values (set) – the set of valid values, defaults to None.

ngram_selection(n=3)¶

Select all the n-grams and populate the candidate container.

Parameters: n (int) – the n-gram length, defaults to 3.

normalization¶: Word normalization method.

sentences¶: Sentence container (list of Sentence objects).

stoplist¶: List of stopwords.

weights¶: Weight container (can be either word or candidate weights).

class pke.data_structures.Candidate¶

The keyphrase candidate data structure.

lexical_form¶: the lexical form of the candidate.

offsets¶: the offsets of the surface forms.

pos_patterns¶: the Part-Of-Speech patterns of the candidate.

sentence_ids¶: the sentence id of each surface form.

surface_forms¶: the surface forms of the candidate.

class pke.data_structures.Sentence(words, pos=[], meta={})¶

The sentence data structure.

length¶: length (number of tokens) of the sentence.

meta¶: meta-information of the sentence.

pos¶: list of Part-Of-Speeches.

stems¶: list of stems.

words¶: list of words (tokens) in the sentence.

Base classes¶

Previous topic

Next topic

This Page