Unsupervised models¶
Statistical models¶
TfIdf¶
- class pke.unsupervised.TfIdf¶
TF*IDF keyphrase extraction model.
Parameterized example:
import string import pke # 1. create a TfIdf extractor. extractor = pke.unsupervised.TfIdf() # 2. load the content of the document. stoplist = list(string.punctuation) stoplist += pke.lang.stopwords.get('en') extractor.load_document(input='path/to/input', language='en', stoplist=stoplist, normalization=None) # 3. select {1-3}-grams not containing punctuation marks as candidates. extractor.candidate_selection(n=3) # 4. weight the candidates using a `tf` x `idf` df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz') extractor.candidate_weighting(df=df) # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- candidate_selection(n=3)¶
Select 1-3 grams as keyphrase candidates.
- Parameters
n (int) – the length of the n-grams, defaults to 3.
- candidate_weighting(df=None)¶
Candidate weighting function using document frequencies.
- Parameters
df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.
KPMiner¶
- class pke.unsupervised.KPMiner¶
KP-Miner keyphrase extraction model.
Parameterized example:
import pke # 1. create a KPMiner extractor. extractor = pke.unsupervised.KPMiner() # 2. load the content of the document. extractor.load_document(input='path/to/input', language='en', normalization=None) # 3. select {1-5}-grams that do not contain punctuation marks or # stopwords as keyphrase candidates. Set the least allowable seen # frequency to 5 and the number of words after which candidates are # filtered out to 200. lasf = 5 cutoff = 200 extractor.candidate_selection(lasf=lasf, cutoff=cutoff) # 4. weight the candidates using KPMiner weighting function. df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz') alpha = 2.3 sigma = 3.0 extractor.candidate_weighting(df=df, alpha=alpha, sigma=sigma) # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- candidate_selection(lasf=3, cutoff=400)¶
The candidate selection as described in the KP-Miner paper.
- Parameters
lasf (int) – least allowable seen frequency, defaults to 3.
cutoff (int) – the number of words after which candidates are filtered out, defaults to 400.
stoplist (list) – the stoplist for filtering candidates, defaults to the nltk stoplist. Words that are punctuation marks from string.punctuation are not allowed.
- candidate_weighting(df=None, sigma=3.0, alpha=2.3)¶
Candidate weight calculation as described in the KP-Miner paper.
Note
w = tf * idf * B * P_f with
B = N_d / (P_d * alpha) and B = min(sigma, B)
N_d = the number of all candidate terms
P_d = number of candidates whose length exceeds one
P_f = 1
- Parameters
df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.
sigma (int) – parameter for boosting factor, defaults to 3.0.
alpha (int) – parameter for boosting factor, defaults to 2.3.
YAKE¶
- class pke.unsupervised.YAKE¶
YAKE keyphrase extraction model.
Parameterized example:
import pke from pke.lang import stopwords # 1. create a YAKE extractor. extractor = pke.unsupervised.YAKE() # 2. load the content of the document. stoplist = stopwords.get('english') extractor.load_document(input='path/to/input', language='en', stoplist=stoplist, normalization=None) # 3. select {1-3}-grams not containing punctuation marks and not # beginning/ending with a stopword as candidates. extractor.candidate_selection(n=3) # 4. weight the candidates using YAKE weighting scheme, a window (in # words) for computing left/right contexts can be specified. window = 2 use_stems = False # use stems instead of words for weighting extractor.candidate_weighting(window=window, use_stems=use_stems) # 5. get the 10-highest scored candidates as keyphrases. # redundant keyphrases are removed from the output using levenshtein # distance and a threshold. threshold = 0.8 keyphrases = extractor.get_n_best(n=10, threshold=threshold)
- candidate_selection(n=3)¶
Select 1-3 grams as keyphrase candidates. Candidates beginning or ending with a stopword are filtered out. Words that do not contain at least one alpha-numeric character are not allowed.
- Parameters
n (int) – the n-gram length, defaults to 3.
- candidate_weighting(window=2, use_stems=False)¶
Candidate weight calculation as described in the YAKE paper.
- Parameters
use_stems (bool) – whether to use stems instead of lowercase words for weighting, defaults to False.
window (int) – the size in words of the window used for computing co-occurrence counts, defaults to 2.
- contexts¶
Container for word contexts.
- features¶
Container for word features.
- get_n_best(n=10, redundancy_removal=True, stemming=False, threshold=0.8)¶
Returns the n-best candidates given the weights.
- Parameters
n (int) – the number of candidates, defaults to 10.
redundancy_removal (bool) – whether redundant keyphrases are filtered out from the n-best list using levenshtein distance, defaults to True.
stemming (bool) – whether to extract stems or surface forms (lowercased, first occurring form of candidate), default to stems.
threshold (float) – the threshold used when computing the levenshtein distance, defaults to 0.8.
- is_redundant(candidate, prev, threshold=0.8)¶
Test if one candidate is redundant with respect to a list of already selected candidates. A candidate is considered redundant if its levenshtein distance, with another candidate that is ranked higher in the list, is greater than a threshold.
- Parameters
candidate (str) – the lexical form of the candidate.
prev (list) – the list of already selected candidates.
threshold (float) – the threshold used when computing the levenshtein distance, defaults to 0.8.
- surface_to_lexical¶
Mapping from surface form to lexical form.
- words¶
Container for the vocabulary.
Graph-based models¶
TextRank¶
- class pke.unsupervised.TextRank¶
TextRank for keyword extraction.
This model builds a graph that represents the text. A graph based ranking algorithm is then applied to extract the lexical units (here the words) that are most important in the text.
In this implementation, nodes are words of certain part-of-speech (nouns and adjectives) and edges represent co-occurrence relation, controlled by the distance between word occurrences (here a window of 2 words). Nodes are ranked by the TextRank graph-based ranking algorithm in its unweighted variant.
Parameterized example:
import pke # define the set of valid Part-of-Speeches pos = {'NOUN', 'PROPN', 'ADJ'} # 1. create a TextRank extractor. extractor = pke.unsupervised.TextRank() # 2. load the content of the document. extractor.load_document(input='path/to/input', language='en', normalization=None) # 3. build the graph representation of the document and rank the words. # Keyphrase candidates are composed from the 33-percent # highest-ranked words. extractor.candidate_weighting(window=2, pos=pos, top_percent=0.33) # 4. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- build_word_graph(window=2, pos=None)¶
Build a graph representation of the document in which nodes/vertices are words and edges represent co-occurrence relation. Syntactic filters can be applied to select only words of certain Part-of-Speech. Co-occurrence relations can be controlled using the distance between word occurrences in the document.
As the original paper does not give precise details on how the word graph is constructed, we make the following assumptions from the example given in Figure 2: 1) sentence boundaries are not taken into account and, 2) stopwords and punctuation marks are considered as words when computing the window.
- Parameters
window (int) – the window for connecting two words in the graph, defaults to 2.
pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
- candidate_selection(pos=None)¶
Candidate selection using longest sequences of PoS.
- Parameters
pos (set) – set of valid POS tags, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
- candidate_weighting(window=2, pos=None, top_percent=None, normalized=False)¶
Tailored candidate ranking method for TextRank. Keyphrase candidates are either composed from the T-percent highest-ranked words as in the original paper or extracted using the candidate_selection() method. Candidates are ranked using the sum of their (normalized?) words.
- Parameters
window (int) – the window for connecting two words in the graph, defaults to 2.
pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
top_percent (float) – percentage of top vertices to keep for phrase generation.
normalized (False) – normalize keyphrase score by their length, defaults to False.
- graph¶
The word graph.
SingleRank¶
- class pke.unsupervised.SingleRank¶
SingleRank keyphrase extraction model.
This model is an extension of the TextRank model that uses the number of co-occurrences to weigh edges in the graph.
Parameterized example:
import pke # define the set of valid Part-of-Speeches pos = {'NOUN', 'PROPN', 'ADJ'} # 1. create a SingleRank extractor. extractor = pke.unsupervised.SingleRank() # 2. load the content of the document. extractor.load_document(input='path/to/input', language='en', normalization=None) # 3. select the longest sequences of nouns and adjectives as candidates. extractor.candidate_selection(pos=pos) # 4. weight the candidates using the sum of their word's scores that are # computed using random walk. In the graph, nodes are words of # certain part-of-speech (nouns and adjectives) that are connected if # they occur in a window of 10 words. extractor.candidate_weighting(window=10, pos=pos) # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- build_word_graph(window=10, pos=None)¶
Build a graph representation of the document in which nodes/vertices are words and edges represent co-occurrence relation. Syntactic filters can be applied to select only words of certain Part-of-Speech. Co-occurrence relations can be controlled using the distance (window) between word occurrences in the document.
The number of times two words co-occur in a window is encoded as edge weights. Sentence boundaries are not taken into account in the window.
- Parameters
window (int) – the window for connecting two words in the graph, defaults to 10.
pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
- candidate_weighting(window=10, pos=None, normalized=False)¶
Keyphrase candidate ranking using the weighted variant of the TextRank formulae. Candidates are scored by the sum of the scores of their words.
- Parameters
window (int) – the window within the sentence for connecting two words in the graph, defaults to 10.
pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
normalized (False) – normalize keyphrase score by their length, defaults to False.
TopicRank¶
- class pke.unsupervised.TopicRank¶
TopicRank keyphrase extraction model.
Parameterized example:
import pke import string # 1. create a TopicRank extractor. extractor = pke.unsupervised.TopicRank() # 2. load the content of the document. stoplist = list(string.punctuation) stoplist += pke.lang.stopwords.get('en') extractor.load_document(input='path/to/input.xml', stoplist=stoplist) # 3. select the longest sequences of nouns and adjectives, that do # not contain punctuation marks or stopwords as candidates. pos = {'NOUN', 'PROPN', 'ADJ'} extractor.candidate_selection(pos=pos) # 4. build topics by grouping candidates with HAC (average linkage, # threshold of 1/4 of shared stems). Weight the topics using random # walk, and select the first occuring candidate from each topic. extractor.candidate_weighting(threshold=0.74, method='average') # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- build_topic_graph()¶
Build topic graph.
- candidate_selection(pos=None)¶
Selects longest sequences of nouns and adjectives as keyphrase candidates.
- Parameters
pos (set) – the set of valid POS tags, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
- candidate_weighting(threshold=0.74, method='average', heuristic=None)¶
Candidate ranking using random walk.
- Parameters
threshold (float) – the minimum similarity for clustering, defaults to 0.74.
method (str) – the linkage method, defaults to average.
heuristic (str) – the heuristic for selecting the best candidate for each topic, defaults to first occurring candidate. Other options are ‘frequent’ (most frequent candidate, position is used for ties).
- graph¶
The topic graph.
- topic_clustering(threshold=0.74, method='average')¶
Clustering candidates into topics.
- Parameters
threshold (float) – the minimum similarity for clustering, defaults to 0.74, i.e. more than 1/4 of stem overlap similarity.
method (str) – the linkage method, defaults to average.
- topics¶
The topic container.
- vectorize_candidates()¶
Vectorize the keyphrase candidates.
- Returns
the list of candidates. X (matrix): vectorized representation of the candidates.
- Return type
C (list)
TopicalPageRank¶
- class pke.unsupervised.TopicalPageRank¶
Single TopicalPageRank keyphrase extraction model.
Parameterized example:
import pke # define the valid Part-of-Speeches to occur in the graph pos = {'NOUN', 'PROPN', 'ADJ'} # define the grammar for selecting the keyphrase candidates grammar = "NP: {<ADJ>*<NOUN|PROPN>+}" # 1. create a TopicalPageRank extractor. extractor = pke.unsupervised.TopicalPageRank() # 2. load the content of the document. extractor.load_document(input='path/to/input', language='en', normalization=None) # 3. select the noun phrases as keyphrase candidates. extractor.candidate_selection(grammar=grammar) # 4. weight the keyphrase candidates using Single Topical PageRank. # Builds a word-graph in which edges connecting two words occurring # in a window are weighted by co-occurrence counts. extractor.candidate_weighting(window=10, pos=pos, lda_model='path/to/lda_model') # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- candidate_selection(grammar=None)¶
Candidate selection heuristic.
Here we select noun phrases that match the regular expression (adjective)*(noun)+, which represents zero or more adjectives followed by one or more nouns (Liu et al., 2010).
Note that there is no details on this in the Single TPR paper, and these are the only information that can be found:
… a set of expressions or noun phrases …
… Adjectives and nouns are then merged into keyphrases and corresponding scores are summed and ranked. …
- Parameters
grammar (str) – grammar defining POS patterns of NPs, defaults to “NP: {<ADJ>*<NOUN|PROPN>+}”.
- candidate_weighting(window=10, pos=None, lda_model=None, normalized=False)¶
Candidate weight calculation using a biased PageRank towards LDA topic distributions.
- Parameters
window (int) – the window within the sentence for connecting two words in the graph, defaults to 10.
pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
lda_model (pickle.gz) – an LDA model produced by sklearn in pickle compressed (.gz) format
normalized (False) – normalize keyphrase score by their length, defaults to False.
PositionRank¶
- class pke.unsupervised.PositionRank¶
PositionRank keyphrase extraction model.
Parameterized example:
import pke # define the valid Part-of-Speeches to occur in the graph pos = {'NOUN', 'PROPN', 'ADJ'} # define the grammar for selecting the keyphrase candidates grammar = "NP: {<ADJ>*<NOUN|PROPN>+}" # 1. create a PositionRank extractor. extractor = pke.unsupervised.PositionRank() # 2. load the content of the document. extractor.load_document(input='path/to/input', language='en', normalization=None) # 3. select the noun phrases up to 3 words as keyphrase candidates. extractor.candidate_selection(grammar=grammar, maximum_word_number=3) # 4. weight the candidates using the sum of their word's scores that are # computed using random walk biaised with the position of the words # in the document. In the graph, nodes are words (nouns and # adjectives only) that are connected if they occur in a window of # 10 words. extractor.candidate_weighting(window=10, pos=pos) # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- build_word_graph(window=10, pos=None)¶
Build the graph representation of the document.
In the graph, nodes are words that passes a Part-of-Speech filter. Two nodes are connected if the words corresponding to these nodes co-occur within a window of contiguous tokens. The weight of an edge is computed based on the co-occurrence count of the two words within a window of successive tokens.
- Parameters
window (int) – the window within the sentence for connecting two words in the graph, defaults to 10.
pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
- candidate_selection(grammar=None, maximum_word_number=3)¶
Candidate selection heuristic using a syntactic PoS pattern for noun phrase extraction.
Keyphrase candidates are noun phrases that match the regular expression (adjective)*(noun)+, of length up to three.
- Parameters
grammar (str) – grammar defining POS patterns of NPs, defaults to “NP: {<ADJ>*<NOUN|PROPN>+}”.
maximum_word_number (int) – the maximum number of words allowed for keyphrase candidates, defaults to 3.
- candidate_weighting(window=10, pos=None, normalized=False)¶
Candidate weight calculation using a biased PageRank.
- Parameters
window (int) – the window within the sentence for connecting two words in the graph, defaults to 10.
pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).
normalized (False) – normalize keyphrase score by their length, defaults to False.
- positions¶
Container the sums of word’s inverse positions.
MultipartiteRank¶
- class pke.unsupervised.MultipartiteRank¶
Multipartite graph keyphrase extraction model.
Parameterized example:
import pke import string # 1. create a MultipartiteRank extractor. extractor = pke.unsupervised.MultipartiteRank() stoplist = list(string.punctuation) stoplist += pke.lang.stopwords.get('en') # 2. load the content of the document. extractor.load_document(input='path/to/input.xml', stoplist=stoplist) # 3. select the longest sequences of nouns and adjectives, that do # not contain punctuation marks or stopwords as candidates. pos = {'NOUN', 'PROPN', 'ADJ'} extractor.candidate_selection(pos=pos) # 4. build the Multipartite graph and rank candidates using random # walk, alpha controls the weight adjustment mechanism, see # TopicRank for threshold/method parameters. extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average') # 5. get the 10-highest scored candidates as keyphrases keyphrases = extractor.get_n_best(n=10)
- build_topic_graph()¶
Build the Multipartite graph.
- candidate_weighting(threshold=0.74, method='average', alpha=1.1)¶
Candidate weight calculation using random walk.
- Parameters
threshold (float) – the minimum similarity for clustering, defaults to 0.25.
method (str) – the linkage method, defaults to average.
alpha (float) – hyper-parameter that controls the strength of the weight adjustment, defaults to 1.1.
- graph¶
Redefine the graph as a directed graph.
- topic_clustering(threshold=0.74, method='average')¶
Clustering candidates into topics.
- Parameters
threshold (float) – the minimum similarity for clustering, defaults to 0.74, i.e. more than 1/4 of stem overlap similarity.
method (str) – the linkage method, defaults to average.
- topic_identifiers¶
A container for linking candidates to topic identifiers.
- weight_adjustment(alpha=1.1)¶
Adjust edge weights for boosting some candidates.
- Parameters
alpha (float) – hyper-parameter that controls the strength of the weight adjustment, defaults to 1.1.