Abstract

Information retrieval in scientific digital libraries is a time consuming and tedious task, because of an often incomplete indexing scientific articles. Accelerated by the uptake of open access to scientific publications, full-text indexing has not yet led to the expected improvement. Rather, exploiting full-text articles requires the use of complex and error-prone Natural Language Processing techniques which may degrade indexing. In previous work, these techniques are often unstated and their impact on the retrieval effectiveness remains unclear. The purpose of the TALIAS project is to re-assess and compare state-of-the-art keyphrase extraction models at increasingly sophisticated levels of document preprocessing. In doing so, we determine to what extend performance variation across keyphrase extraction systems is a function of the effectiveness of document preprocessing, and study their robustness over noisy text.

Results

Participants

Publications