Base classes¶
Base classes for the pke module.
- class pke.base.LoadFile¶
The LoadFile class that provides base functions.
- add_candidate(words, stems, pos, offset, sentence_id)¶
Add a keyphrase candidate to the candidates container.
- Parameters
words (list) – the words (surface form) of the candidate.
stems (list) – the stemmed words of the candidate.
pos (list) – the Part-Of-Speeches of the words in the candidate.
offset (int) – the offset of the first word of the candidate.
sentence_id (int) – the sentence id of the candidate.
- candidate_filtering(minimum_length=3, minimum_word_size=2, valid_punctuation_marks='-', maximum_word_number=5, only_alphanum=True, pos_blacklist=None)¶
Filter the candidates containing strings from the stoplist. Only keep the candidates containing alpha-numeric characters (if the non_latin_filter is set to True) and those length exceeds a given number of characters.
- Parameters
minimum_length (int) – minimum number of characters for a candidate, defaults to 3.
minimum_word_size (int) – minimum number of characters for a token to be considered as a valid word, defaults to 2.
valid_punctuation_marks (str) – punctuation marks that are valid for a candidate, defaults to ‘-‘.
maximum_word_number (int) – maximum length in words of the candidate, defaults to 5.
only_alphanum (bool) – filter candidates containing non (latin) alpha-numeric characters, defaults to True.
pos_blacklist (list) – list of unwanted Part-Of-Speeches in candidates, defaults to [].
- candidates¶
Keyphrase candidates container (dict of Candidate objects).
- get_n_best(n=10, redundancy_removal=False, stemming=False)¶
Returns the n-best candidates given the weights.
- Parameters
n (int) – the number of candidates, defaults to 10.
redundancy_removal (bool) – whether redundant keyphrases are filtered out from the n-best list, defaults to False.
stemming (bool) – whether to extract stems or surface forms (lowercased, first occurring form of candidate), default to False.
- grammar_selection(grammar=None)¶
Select candidates using nltk RegexpParser with a grammar defining noun phrases (NP).
- Parameters
grammar (str) – grammar defining POS patterns of NPs.
- is_redundant(candidate, prev, minimum_length=1)¶
Test if one candidate is redundant with respect to a list of already selected candidates. A candidate is considered redundant if it is included in another candidate that is ranked higher in the list.
- Parameters
candidate (str) – the lexical form of the candidate.
prev (list) – the list of already selected candidates (lexical forms).
minimum_length (int) – minimum length (in words) of the candidate to be considered, defaults to 1.
- language¶
Language of the input file.
- load_document(input, language=None, stoplist=None, normalization='stemming', spacy_model=None)¶
Loads the content of a document/string/stream in a given language.
- Parameters
input (str) – input.
language (str) – language of the input, defaults to ‘en’.
stoplist (list) – custom list of stopwords, defaults to pke.lang.stopwords[language].
normalization (str) – word normalization method, defaults to ‘stemming’. Other possible value is ‘none’ for using word surface forms instead of stems/lemmas.
spacy_model (spacy.lang) – preloaded spacy model when input is a string.
- longest_sequence_selection(key, valid_values)¶
Select the longest sequences of given POS tags as candidates.
- Parameters
key (func) – function that given a sentence return an iterable
valid_values (set) – the set of valid values, defaults to None.
- ngram_selection(n=3)¶
Select all the n-grams and populate the candidate container.
- Parameters
n (int) – the n-gram length, defaults to 3.
- normalization¶
Word normalization method.
- sentences¶
Sentence container (list of Sentence objects).
- stoplist¶
List of stopwords.
- weights¶
Weight container (can be either word or candidate weights).
- class pke.data_structures.Candidate¶
The keyphrase candidate data structure.
- lexical_form¶
the lexical form of the candidate.
- offsets¶
the offsets of the surface forms.
- pos_patterns¶
the Part-Of-Speech patterns of the candidate.
- sentence_ids¶
the sentence id of each surface form.
- surface_forms¶
the surface forms of the candidate.