Useful functions

Useful functions for the pke module.

pke.utils.compute_document_frequency(documents, output_file, language='en', stoplist=None, normalization='stemming', delimiter='\t', n=3)

Compute the n-gram document frequencies from a set of input documents. An extra row is added to the output file for specifying the number of documents from which the document frequencies were computed (–NB_DOC– tab XXX). The output file is compressed using gzip.

Parameters
  • documents (list) – list of pke-readable documents.

  • output_file (str) – the output file.

  • language (str) – language of the input documents (used for computing the n-stem or n-lemma forms), defaults to ‘en’ (english).

  • stoplist (list) – the stop words for filtering n-grams, default to pke.lang.stopwords[language].

  • normalization (str) – word normalization method, defaults to ‘stemming’. Other possible value is ‘none’ for using word surface forms instead of stems/lemmas.

  • delimiter (str) – the delimiter between n-grams and document frequencies, defaults to tabulation ( ).

  • n (int) – the size of the n-grams, defaults to 3.

pke.utils.compute_lda_model(documents, output_file, n_topics=500, language='en', stoplist=None, normalization='stemming')

Compute a LDA model from a collection of documents. Latent Dirichlet Allocation is computed using sklearn module.

Parameters
  • documents (str) – list fo pke-readable documents.

  • output_file (str) – the output file.

  • n_topics (int) – number of topics for the LDA model, defaults to 500.

  • language (str) – language of the input documents, used for stop_words in sklearn CountVectorizer, defaults to ‘en’.

  • stoplist (list) – the stop words for filtering words, default to pke.lang.stopwords[language].

  • normalization (str) – word normalization method, defaults to ‘stemming’. Other possible value is ‘none’ for using word surface forms instead of stems/lemmas.

pke.utils.load_document_frequency_file(input_file, delimiter='\t')

Load a tsv (tab-separated-values) file containing document frequencies. Automatically detects if input file is compressed (gzip) by looking at its extension (.gz).

Parameters
  • input_file (str) – the input file containing document frequencies in csv format.

  • delimiter (str) – the delimiter used for separating term-document frequencies tuples, defaults to ‘ ‘.

Returns

a dictionary of the form {term_1: freq}, freq being an integer.

Return type

dict

pke.utils.load_lda_model(input_file)

Load a gzip file containing lda model.

Parameters

input_file (str) – the gzip input file containing lda model.

Returns

a dictionary of the form {term_1: freq}, freq being an

integer.

model: an initialized sklearn.decomposition.LatentDirichletAllocation

model.

Return type

dictionnary

pke.utils.load_references(input_file, sep_doc_id=':', sep_ref_keyphrases=',', normalize_reference=False, language='en', encoding=None, excluded_file=None)

Load a reference file. Reference file can be either in json format or in the SemEval-2010 official format.

Parameters
  • input_file (str) – path to the reference file.

  • sep_doc_id (str) – the separator used for doc_id in reference file, defaults to ‘:’.

  • sep_ref_keyphrases (str) – the separator used for keyphrases in reference file, defaults to ‘,’.

  • normalize_reference (bool) – whether to normalize the reference keyphrases using stemming, default to False.

  • language (str) – language of the input documents (used for computing the stems), defaults to ‘en’ (english).

  • encoding (str) – file encoding, default to None.

  • excluded_file (str) – file to exclude (for leave-one-out cross-validation), defaults to None.

pke.utils.train_supervised_model(documents, references, model_file, language='en', stoplist=None, normalization='stemming', df=None, model=None, leave_one_out=False)

Build a supervised keyphrase extraction model from a set of documents and reference keywords.

Parameters
  • documents (list) – list of tuple (id, pke-readable documents). `id`s should match the one in reference.

  • references (dict) – reference keywords.

  • model_file (str) – the model output file.

  • language (str) – language of the input documents (used for computing the n-stem or n-lemma forms), defaults to ‘en’ (english).

  • stoplist (list) – the stop words for filtering n-grams, default to pke.lang.stopwords[language].

  • normalization (str) – word normalization method, defaults to ‘stemming’. Other possible values are ‘lemmatization’ or ‘None’ for using word surface forms instead of stems/lemmas.

  • df (dict) – df weights dictionary.

  • model (object) – the supervised model to train, defaults to Kea.

  • leave_one_out (bool) – whether to use a leave-one-out procedure for training, creating one model per input, defaults to False.