Training supervised modelsΒΆ
pke ships with a collection of already trained models (for supervised
keyphrase extraction approaches) and document frequency counts that were
computed on the training set of the SemEval-2010 benchmark dataset. These
resources are located into the pke/models/
directory.
Note that already trained models/DF counts are used by default if no parameters are given.
The following snippet of code illustrates how to train a new supervised model:
import pke
"""Train a Kea model given a collection of document, a document frequency
counts file and a reference file (gold keyphrases).
"""
# load the DF counts from file
df_counts = pke.load_document_frequency_file(input_file='/path/to/df_counts')
# train a new Kea model
pke.train_supervised_model(input_dir='/path/to/collection/of/documents/',
reference_file='/path/to/reference/file',
model_file='/path/to/model/file',
df=df_counts,
extension='xml',
language='en',
normalization="stemming",
model=pke.supervised.Kea())
The training data consists of a set of documents along with a reference file containing annotated keyphrases in the following formats:
SemEval-2010 format, i.e.
FILENAME\s:\sKEYPHRASE_LIST
C-41 : hybrid system,quality of service+service quality, [...]
json format
{
"C-41": [
[
"hybrid system"
],
[
"quality of service",
"service quality"
],
[...]
]
}