Unsupervised models

Statistical models

TfIdf

class pke.unsupervised.TfIdf

TF*IDF keyphrase extraction model.

Parameterized example:

import string
import pke

# 1. create a TfIdf extractor.
extractor = pke.unsupervised.TfIdf()

# 2. load the content of the document.
stoplist = list(string.punctuation)
stoplist += pke.lang.stopwords.get('en')
extractor.load_document(input='path/to/input',
                        language='en',
                        stoplist=stoplist,
                        normalization=None)

# 3. select {1-3}-grams not containing punctuation marks as candidates.
extractor.candidate_selection(n=3)

# 4. weight the candidates using a `tf` x `idf`
df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
extractor.candidate_weighting(df=df)

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
candidate_selection(n=3)

Select 1-3 grams as keyphrase candidates.

Parameters

n (int) – the length of the n-grams, defaults to 3.

candidate_weighting(df=None)

Candidate weighting function using document frequencies.

Parameters

df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.

KPMiner

class pke.unsupervised.KPMiner

KP-Miner keyphrase extraction model.

Parameterized example:

import pke

# 1. create a KPMiner extractor.
extractor = pke.unsupervised.KPMiner()

# 2. load the content of the document.
extractor.load_document(input='path/to/input',
                        language='en',
                        normalization=None)


# 3. select {1-5}-grams that do not contain punctuation marks or
#    stopwords as keyphrase candidates. Set the least allowable seen
#    frequency to 5 and the number of words after which candidates are
#    filtered out to 200.
lasf = 5
cutoff = 200
extractor.candidate_selection(lasf=lasf, cutoff=cutoff)

# 4. weight the candidates using KPMiner weighting function.
df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
alpha = 2.3
sigma = 3.0
extractor.candidate_weighting(df=df, alpha=alpha, sigma=sigma)

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
candidate_selection(lasf=3, cutoff=400)

The candidate selection as described in the KP-Miner paper.

Parameters
  • lasf (int) – least allowable seen frequency, defaults to 3.

  • cutoff (int) – the number of words after which candidates are filtered out, defaults to 400.

  • stoplist (list) – the stoplist for filtering candidates, defaults to the nltk stoplist. Words that are punctuation marks from string.punctuation are not allowed.

candidate_weighting(df=None, sigma=3.0, alpha=2.3)

Candidate weight calculation as described in the KP-Miner paper.

Note

w = tf * idf * B * P_f with

  • B = N_d / (P_d * alpha) and B = min(sigma, B)

  • N_d = the number of all candidate terms

  • P_d = number of candidates whose length exceeds one

  • P_f = 1

Parameters
  • df (dict) – document frequencies, the number of documents should be specified using the “–NB_DOC–” key.

  • sigma (int) – parameter for boosting factor, defaults to 3.0.

  • alpha (int) – parameter for boosting factor, defaults to 2.3.

YAKE

class pke.unsupervised.YAKE

YAKE keyphrase extraction model.

Parameterized example:

import pke
from pke.lang import stopwords

# 1. create a YAKE extractor.
extractor = pke.unsupervised.YAKE()

# 2. load the content of the document.
stoplist = stopwords.get('english')
extractor.load_document(input='path/to/input',
                        language='en',
                        stoplist=stoplist,
                        normalization=None)


# 3. select {1-3}-grams not containing punctuation marks and not
#    beginning/ending with a stopword as candidates.
extractor.candidate_selection(n=3)

# 4. weight the candidates using YAKE weighting scheme, a window (in
#    words) for computing left/right contexts can be specified.
window = 2
use_stems = False # use stems instead of words for weighting
extractor.candidate_weighting(window=window,
                              use_stems=use_stems)

# 5. get the 10-highest scored candidates as keyphrases.
#    redundant keyphrases are removed from the output using levenshtein
#    distance and a threshold.
threshold = 0.8
keyphrases = extractor.get_n_best(n=10, threshold=threshold)
candidate_selection(n=3)

Select 1-3 grams as keyphrase candidates. Candidates beginning or ending with a stopword are filtered out. Words that do not contain at least one alpha-numeric character are not allowed.

Parameters

n (int) – the n-gram length, defaults to 3.

candidate_weighting(window=2, use_stems=False)

Candidate weight calculation as described in the YAKE paper.

Parameters
  • use_stems (bool) – whether to use stems instead of lowercase words for weighting, defaults to False.

  • window (int) – the size in words of the window used for computing co-occurrence counts, defaults to 2.

contexts

Container for word contexts.

features

Container for word features.

get_n_best(n=10, redundancy_removal=True, stemming=False, threshold=0.8)

Returns the n-best candidates given the weights.

Parameters
  • n (int) – the number of candidates, defaults to 10.

  • redundancy_removal (bool) – whether redundant keyphrases are filtered out from the n-best list using levenshtein distance, defaults to True.

  • stemming (bool) – whether to extract stems or surface forms (lowercased, first occurring form of candidate), default to stems.

  • threshold (float) – the threshold used when computing the levenshtein distance, defaults to 0.8.

is_redundant(candidate, prev, threshold=0.8)

Test if one candidate is redundant with respect to a list of already selected candidates. A candidate is considered redundant if its levenshtein distance, with another candidate that is ranked higher in the list, is greater than a threshold.

Parameters
  • candidate (str) – the lexical form of the candidate.

  • prev (list) – the list of already selected candidates.

  • threshold (float) – the threshold used when computing the levenshtein distance, defaults to 0.8.

surface_to_lexical

Mapping from surface form to lexical form.

words

Container for the vocabulary.

Graph-based models

TextRank

class pke.unsupervised.TextRank

TextRank for keyword extraction.

This model builds a graph that represents the text. A graph based ranking algorithm is then applied to extract the lexical units (here the words) that are most important in the text.

In this implementation, nodes are words of certain part-of-speech (nouns and adjectives) and edges represent co-occurrence relation, controlled by the distance between word occurrences (here a window of 2 words). Nodes are ranked by the TextRank graph-based ranking algorithm in its unweighted variant.

Parameterized example:

import pke

# define the set of valid Part-of-Speeches
pos = {'NOUN', 'PROPN', 'ADJ'}

# 1. create a TextRank extractor.
extractor = pke.unsupervised.TextRank()

# 2. load the content of the document.
extractor.load_document(input='path/to/input',
                        language='en',
                        normalization=None)

# 3. build the graph representation of the document and rank the words.
#    Keyphrase candidates are composed from the 33-percent
#    highest-ranked words.
extractor.candidate_weighting(window=2,
                              pos=pos,
                              top_percent=0.33)

# 4. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
build_word_graph(window=2, pos=None)

Build a graph representation of the document in which nodes/vertices are words and edges represent co-occurrence relation. Syntactic filters can be applied to select only words of certain Part-of-Speech. Co-occurrence relations can be controlled using the distance between word occurrences in the document.

As the original paper does not give precise details on how the word graph is constructed, we make the following assumptions from the example given in Figure 2: 1) sentence boundaries are not taken into account and, 2) stopwords and punctuation marks are considered as words when computing the window.

Parameters
  • window (int) – the window for connecting two words in the graph, defaults to 2.

  • pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

candidate_selection(pos=None)

Candidate selection using longest sequences of PoS.

Parameters

pos (set) – set of valid POS tags, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

candidate_weighting(window=2, pos=None, top_percent=None, normalized=False)

Tailored candidate ranking method for TextRank. Keyphrase candidates are either composed from the T-percent highest-ranked words as in the original paper or extracted using the candidate_selection() method. Candidates are ranked using the sum of their (normalized?) words.

Parameters
  • window (int) – the window for connecting two words in the graph, defaults to 2.

  • pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

  • top_percent (float) – percentage of top vertices to keep for phrase generation.

  • normalized (False) – normalize keyphrase score by their length, defaults to False.

graph

The word graph.

SingleRank

class pke.unsupervised.SingleRank

SingleRank keyphrase extraction model.

This model is an extension of the TextRank model that uses the number of co-occurrences to weigh edges in the graph.

Parameterized example:

import pke

# define the set of valid Part-of-Speeches
pos = {'NOUN', 'PROPN', 'ADJ'}

# 1. create a SingleRank extractor.
extractor = pke.unsupervised.SingleRank()

# 2. load the content of the document.
extractor.load_document(input='path/to/input',
                        language='en',
                        normalization=None)

# 3. select the longest sequences of nouns and adjectives as candidates.
extractor.candidate_selection(pos=pos)

# 4. weight the candidates using the sum of their word's scores that are
#    computed using random walk. In the graph, nodes are words of
#    certain part-of-speech (nouns and adjectives) that are connected if
#    they occur in a window of 10 words.
extractor.candidate_weighting(window=10,
                              pos=pos)

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
build_word_graph(window=10, pos=None)

Build a graph representation of the document in which nodes/vertices are words and edges represent co-occurrence relation. Syntactic filters can be applied to select only words of certain Part-of-Speech. Co-occurrence relations can be controlled using the distance (window) between word occurrences in the document.

The number of times two words co-occur in a window is encoded as edge weights. Sentence boundaries are not taken into account in the window.

Parameters
  • window (int) – the window for connecting two words in the graph, defaults to 10.

  • pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

candidate_weighting(window=10, pos=None, normalized=False)

Keyphrase candidate ranking using the weighted variant of the TextRank formulae. Candidates are scored by the sum of the scores of their words.

Parameters
  • window (int) – the window within the sentence for connecting two words in the graph, defaults to 10.

  • pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

  • normalized (False) – normalize keyphrase score by their length, defaults to False.

TopicRank

class pke.unsupervised.TopicRank

TopicRank keyphrase extraction model.

Parameterized example:

import pke
import string

# 1. create a TopicRank extractor.
extractor = pke.unsupervised.TopicRank()

# 2. load the content of the document.
stoplist = list(string.punctuation)
stoplist += pke.lang.stopwords.get('en')
extractor.load_document(input='path/to/input.xml',
                        stoplist=stoplist)

# 3. select the longest sequences of nouns and adjectives, that do
#    not contain punctuation marks or stopwords as candidates.
pos = {'NOUN', 'PROPN', 'ADJ'}
extractor.candidate_selection(pos=pos)

# 4. build topics by grouping candidates with HAC (average linkage,
#    threshold of 1/4 of shared stems). Weight the topics using random
#    walk, and select the first occuring candidate from each topic.
extractor.candidate_weighting(threshold=0.74, method='average')

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
build_topic_graph()

Build topic graph.

candidate_selection(pos=None)

Selects longest sequences of nouns and adjectives as keyphrase candidates.

Parameters

pos (set) – the set of valid POS tags, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

candidate_weighting(threshold=0.74, method='average', heuristic=None)

Candidate ranking using random walk.

Parameters
  • threshold (float) – the minimum similarity for clustering, defaults to 0.74.

  • method (str) – the linkage method, defaults to average.

  • heuristic (str) – the heuristic for selecting the best candidate for each topic, defaults to first occurring candidate. Other options are ‘frequent’ (most frequent candidate, position is used for ties).

graph

The topic graph.

topic_clustering(threshold=0.74, method='average')

Clustering candidates into topics.

Parameters
  • threshold (float) – the minimum similarity for clustering, defaults to 0.74, i.e. more than 1/4 of stem overlap similarity.

  • method (str) – the linkage method, defaults to average.

topics

The topic container.

vectorize_candidates()

Vectorize the keyphrase candidates.

Returns

the list of candidates. X (matrix): vectorized representation of the candidates.

Return type

C (list)

TopicalPageRank

class pke.unsupervised.TopicalPageRank

Single TopicalPageRank keyphrase extraction model.

Parameterized example:

import pke

# define the valid Part-of-Speeches to occur in the graph
pos = {'NOUN', 'PROPN', 'ADJ'}

# define the grammar for selecting the keyphrase candidates
grammar = "NP: {<ADJ>*<NOUN|PROPN>+}"

# 1. create a TopicalPageRank extractor.
extractor = pke.unsupervised.TopicalPageRank()

# 2. load the content of the document.
extractor.load_document(input='path/to/input',
                        language='en',
                        normalization=None)

# 3. select the noun phrases as keyphrase candidates.
extractor.candidate_selection(grammar=grammar)

# 4. weight the keyphrase candidates using Single Topical PageRank.
#    Builds a word-graph in which edges connecting two words occurring
#    in a window are weighted by co-occurrence counts.
extractor.candidate_weighting(window=10,
                              pos=pos,
                              lda_model='path/to/lda_model')

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
candidate_selection(grammar=None)

Candidate selection heuristic.

Here we select noun phrases that match the regular expression (adjective)*(noun)+, which represents zero or more adjectives followed by one or more nouns (Liu et al., 2010).

Note that there is no details on this in the Single TPR paper, and these are the only information that can be found:

… a set of expressions or noun phrases …

… Adjectives and nouns are then merged into keyphrases and corresponding scores are summed and ranked. …

Parameters

grammar (str) – grammar defining POS patterns of NPs, defaults to “NP: {<ADJ>*<NOUN|PROPN>+}”.

candidate_weighting(window=10, pos=None, lda_model=None, normalized=False)

Candidate weight calculation using a biased PageRank towards LDA topic distributions.

Parameters
  • window (int) – the window within the sentence for connecting two words in the graph, defaults to 10.

  • pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

  • lda_model (pickle.gz) – an LDA model produced by sklearn in pickle compressed (.gz) format

  • normalized (False) – normalize keyphrase score by their length, defaults to False.

PositionRank

class pke.unsupervised.PositionRank

PositionRank keyphrase extraction model.

Parameterized example:

import pke

# define the valid Part-of-Speeches to occur in the graph
pos = {'NOUN', 'PROPN', 'ADJ'}

# define the grammar for selecting the keyphrase candidates
grammar = "NP: {<ADJ>*<NOUN|PROPN>+}"

# 1. create a PositionRank extractor.
extractor = pke.unsupervised.PositionRank()

# 2. load the content of the document.
extractor.load_document(input='path/to/input',
                        language='en',
                        normalization=None)

# 3. select the noun phrases up to 3 words as keyphrase candidates.
extractor.candidate_selection(grammar=grammar,
                              maximum_word_number=3)

# 4. weight the candidates using the sum of their word's scores that are
#    computed using random walk biaised with the position of the words
#    in the document. In the graph, nodes are words (nouns and
#    adjectives only) that are connected if they occur in a window of
#    10 words.
extractor.candidate_weighting(window=10,
                              pos=pos)

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
build_word_graph(window=10, pos=None)

Build the graph representation of the document.

In the graph, nodes are words that passes a Part-of-Speech filter. Two nodes are connected if the words corresponding to these nodes co-occur within a window of contiguous tokens. The weight of an edge is computed based on the co-occurrence count of the two words within a window of successive tokens.

Parameters
  • window (int) – the window within the sentence for connecting two words in the graph, defaults to 10.

  • pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

candidate_selection(grammar=None, maximum_word_number=3)

Candidate selection heuristic using a syntactic PoS pattern for noun phrase extraction.

Keyphrase candidates are noun phrases that match the regular expression (adjective)*(noun)+, of length up to three.

Parameters
  • grammar (str) – grammar defining POS patterns of NPs, defaults to “NP: {<ADJ>*<NOUN|PROPN>+}”.

  • maximum_word_number (int) – the maximum number of words allowed for keyphrase candidates, defaults to 3.

candidate_weighting(window=10, pos=None, normalized=False)

Candidate weight calculation using a biased PageRank.

Parameters
  • window (int) – the window within the sentence for connecting two words in the graph, defaults to 10.

  • pos (set) – the set of valid pos for words to be considered as nodes in the graph, defaults to (‘NOUN’, ‘PROPN’, ‘ADJ’).

  • normalized (False) – normalize keyphrase score by their length, defaults to False.

positions

Container the sums of word’s inverse positions.

MultipartiteRank

class pke.unsupervised.MultipartiteRank

Multipartite graph keyphrase extraction model.

Parameterized example:

import pke
import string

# 1. create a MultipartiteRank extractor.
extractor = pke.unsupervised.MultipartiteRank()

stoplist = list(string.punctuation)
stoplist += pke.lang.stopwords.get('en')

# 2. load the content of the document.
extractor.load_document(input='path/to/input.xml',
                        stoplist=stoplist)

# 3. select the longest sequences of nouns and adjectives, that do
#    not contain punctuation marks or stopwords as candidates.
pos = {'NOUN', 'PROPN', 'ADJ'}
extractor.candidate_selection(pos=pos)

# 4. build the Multipartite graph and rank candidates using random
#    walk, alpha controls the weight adjustment mechanism, see
#    TopicRank for threshold/method parameters.
extractor.candidate_weighting(alpha=1.1,
                              threshold=0.74,
                              method='average')

# 5. get the 10-highest scored candidates as keyphrases
keyphrases = extractor.get_n_best(n=10)
build_topic_graph()

Build the Multipartite graph.

candidate_weighting(threshold=0.74, method='average', alpha=1.1)

Candidate weight calculation using random walk.

Parameters
  • threshold (float) – the minimum similarity for clustering, defaults to 0.25.

  • method (str) – the linkage method, defaults to average.

  • alpha (float) – hyper-parameter that controls the strength of the weight adjustment, defaults to 1.1.

graph

Redefine the graph as a directed graph.

topic_clustering(threshold=0.74, method='average')

Clustering candidates into topics.

Parameters
  • threshold (float) – the minimum similarity for clustering, defaults to 0.74, i.e. more than 1/4 of stem overlap similarity.

  • method (str) – the linkage method, defaults to average.

topic_identifiers

A container for linking candidates to topic identifiers.

weight_adjustment(alpha=1.1)

Adjust edge weights for boosting some candidates.

Parameters

alpha (float) – hyper-parameter that controls the strength of the weight adjustment, defaults to 1.1.