Information retrieval in scientific digital libraries is a time consuming and tedious task, because of an often incomplete indexing scientific articles. Accelerated by the uptake of open access to scientific publications, full-text indexing has not yet led to the expected improvement. Rather, exploiting full-text articles requires the use of complex and error-prone Natural Language Processing techniques which may degrade indexing. In previous work, these techniques are often unstated and their impact on the retrieval effectiveness remains unclear. The purpose of the TALIAS project is to re-assess and compare state-of-the-art keyphrase extraction models at increasingly sophisticated levels of document preprocessing. In doing so, we determine to what extend performance variation across keyphrase extraction systems is a function of the effectiveness of document preprocessing, and study their robustness over noisy text.
We showed that performance variation across keyphrase extraction systems is, at least in part, a function of the (often unstated) effectiveness of document preprocessing.
We empirically showed that supervised models are more resilient to noise, and pointed out that the performance gap between baselines and top performing systems is narrowing with the increase in preprocessing effort.
We compared the previously reported results of several keyphrase extraction models with that of our re-implementation, and observed that baseline performance is underestimated because of the inconsistence in document preprocessing.
We released both a new version of the SemEval-2010 dataset with preprocessed documents and our implementation of the state-of-the-art keyphrase extraction models using the
pketoolkit for use by the community.
- Florian Boudin - Principal investigator
- Béatrice Daille
- Nicolas Hernandez
- Adrien Bougouin
- Hugo Mougard
- Damien Cram
How Document Pre-processing affects Keyphrase Extraction Performance.
[arXiv, bib, code, dataset]
Florian Boudin, Hugo Mougard and Damien Cram.
COLING 2016 Workshop on Noisy User-generated Text (WNUT).
pke: an open source python-based keyphrase extraction toolkit.
International Conference on Computational Linguistics (COLING), demonstration papers, 2016.