Useful functions¶
Useful functions for the pke module.
- pke.utils.compute_document_frequency(documents, output_file, language='en', stoplist=None, normalization='stemming', delimiter='\t', n=3)¶
Compute the n-gram document frequencies from a set of input documents. An extra row is added to the output file for specifying the number of documents from which the document frequencies were computed (–NB_DOC– tab XXX). The output file is compressed using gzip.
- Parameters
documents (list) – list of pke-readable documents.
output_file (str) – the output file.
language (str) – language of the input documents (used for computing the n-stem or n-lemma forms), defaults to ‘en’ (english).
stoplist (list) – the stop words for filtering n-grams, default to pke.lang.stopwords[language].
normalization (str) – word normalization method, defaults to ‘stemming’. Other possible value is ‘none’ for using word surface forms instead of stems/lemmas.
delimiter (str) – the delimiter between n-grams and document frequencies, defaults to tabulation ( ).
n (int) – the size of the n-grams, defaults to 3.
- pke.utils.compute_lda_model(documents, output_file, n_topics=500, language='en', stoplist=None, normalization='stemming')¶
Compute a LDA model from a collection of documents. Latent Dirichlet Allocation is computed using sklearn module.
- Parameters
documents (str) – list fo pke-readable documents.
output_file (str) – the output file.
n_topics (int) – number of topics for the LDA model, defaults to 500.
language (str) – language of the input documents, used for stop_words in sklearn CountVectorizer, defaults to ‘en’.
stoplist (list) – the stop words for filtering words, default to pke.lang.stopwords[language].
normalization (str) – word normalization method, defaults to ‘stemming’. Other possible value is ‘none’ for using word surface forms instead of stems/lemmas.
- pke.utils.load_document_frequency_file(input_file, delimiter='\t')¶
Load a tsv (tab-separated-values) file containing document frequencies. Automatically detects if input file is compressed (gzip) by looking at its extension (.gz).
- Parameters
input_file (str) – the input file containing document frequencies in csv format.
delimiter (str) – the delimiter used for separating term-document frequencies tuples, defaults to ‘ ‘.
- Returns
a dictionary of the form {term_1: freq}, freq being an integer.
- Return type
dict
- pke.utils.load_lda_model(input_file)¶
Load a gzip file containing lda model.
- Parameters
input_file (str) – the gzip input file containing lda model.
- Returns
- a dictionary of the form {term_1: freq}, freq being an
integer.
- model: an initialized sklearn.decomposition.LatentDirichletAllocation
model.
- Return type
dictionnary
- pke.utils.load_references(input_file, sep_doc_id=':', sep_ref_keyphrases=',', normalize_reference=False, language='en', encoding=None, excluded_file=None)¶
Load a reference file. Reference file can be either in json format or in the SemEval-2010 official format.
- Parameters
input_file (str) – path to the reference file.
sep_doc_id (str) – the separator used for doc_id in reference file, defaults to ‘:’.
sep_ref_keyphrases (str) – the separator used for keyphrases in reference file, defaults to ‘,’.
normalize_reference (bool) – whether to normalize the reference keyphrases using stemming, default to False.
language (str) – language of the input documents (used for computing the stems), defaults to ‘en’ (english).
encoding (str) – file encoding, default to None.
excluded_file (str) – file to exclude (for leave-one-out cross-validation), defaults to None.
- pke.utils.train_supervised_model(documents, references, model_file, language='en', stoplist=None, normalization='stemming', df=None, model=None, leave_one_out=False)¶
Build a supervised keyphrase extraction model from a set of documents and reference keywords.
- Parameters
documents (list) – list of tuple (id, pke-readable documents). `id`s should match the one in reference.
references (dict) – reference keywords.
model_file (str) – the model output file.
language (str) – language of the input documents (used for computing the n-stem or n-lemma forms), defaults to ‘en’ (english).
stoplist (list) – the stop words for filtering n-grams, default to pke.lang.stopwords[language].
normalization (str) – word normalization method, defaults to ‘stemming’. Other possible values are ‘lemmatization’ or ‘None’ for using word surface forms instead of stems/lemmas.
df (dict) – df weights dictionary.
model (object) – the supervised model to train, defaults to Kea.
leave_one_out (bool) – whether to use a leave-one-out procedure for training, creating one model per input, defaults to False.