pkeis an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extented to develop new approaches. This toolkit also allows for easy benchmarking of state-of-the-art keyphrase extraction approaches, and ships with supervised models trained on the SemEval-2010 dataset.
takaheis a multi-sentence compression module. Given a set of redundant sentences, a word-graph is constructed by iteratively adding sentences to it. The best compression is obtained by finding the shortest path in the word graph. A keyphrase-based reranking method can be applied to generate more informative compressions.
A Large-Scale Dataset for Keyphrase Generation on News Documents (KPNews)
A large-scale dataset of 279,923 news texts paired with editor-curated keyphrases for training and evaluating neural keyphrase generation models on the news domain.
Preprocessed SemEval-2010 benchmark dataset
The SemEval-2010 benchmark dataset for automatic keyphrase extraction already preprocessed at four increasingly sophisticated levels of linguistic preprocessing.
Digital archive of French research articles in Natural Language Processing (TALN Archives)
TALN Archives is a digital archive of French research articles in Natural Language Processing. It contains the articles published at the TALN and RECITAL conferences from 1997 to 2015.
LINA Multi-sentence Compression dataset (LINA-msc)
LINA-msc is a dataset for evaluating Multi-sentence Compression in French. It is made of 40 sets of related sentences along with reference compressions composed by human assessors.
CLinical Information Retrieval Evaluation Collection (CLIREC)
CLIREC is a a test collection for clinical IR. From a set of systematic reviews, we have generated 423 queries with relevance data. Relevance judgments were manually collected from the References section containing the citations from which the synthetized results of the review were extracted.