Bibliographic Metadata Extraction

SciPlore’s technologies Citation-based Plagiarism Detection and Citation Proximity Analysis depend on the availability of correct bibliographic metadata, e.g. concerning author and title information, references or citations. We developed different tools that are capable of extracting the required information from PDF files.

Headerdata Extraction Framework

For obtaining general article metadata, such as title [1], authors, affiliations, journal and DOI from PDF documents we conducted reviews of tools available for that task and found that all have individual strengths and weaknesses. Instead of picking just one tool for accomplishing the work, a framework for combining metadata extraction tools has been developed.

The framework accepts PDF documents as input and returns the extracted metadata in form of a unified data structure. By handling the execution of specific tools through modules of the framework it is possible to change and substitute specific tools easily. Currently, we are working on using the framework for constructing a hybrid approach that combines the best results yielded by the different extraction tools.

Advanced Automated Citation Extraction

Accurate information about the position of citations within the full-text is required for performing Citation-based Plagiarism Detection and Citation Proximity Analysis. By reviewing available citation extraction tools we found that none of them allows for sophisticated position analysis.

We decided to enhance existing Open Source tools with methods for identifying the position of citations on the character, sentence and section level of a text. We developed an enhanced version of the Open Source tool ParsCit, since it yielded very good parsing results. In the future, we intend to extent more tools in a similar way.

Related Publications

[1] [pdf] Joeran Beel, Bela Gipp, Ammar Shaker, and Nick Friedrich. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size). In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, and I. Frommholz, editors, Research and advanced technology for digital libraries, proceedings of the 14th european conference on digital libraries (ecdl’10), volume 6273 of Lecture Notes of Computer Science (LNCS), pages 413-416, Glasgow (UK), 2010. Springer. Available at http://gipp.com/pub
[Bibtex]
@INPROCEEDINGS{Beel10e,
author = {Joeran Beel and Bela Gipp and Ammar Shaker and Nick Friedrich},
title = {{S}ci{P}lore {X}tract: {E}xtracting {T}itles from {S}cientific {PDF} {D}ocuments by {A}nalyzing {S}tyle {I}nformation ({F}ont {S}ize)},
booktitle = {Research and Advanced Technology for Digital Libraries, Proceedings of the 14th European Conference on Digital Libraries (ECDL'10)},
year = {2010},
editor = {M. Lalmas and J. Jose and A. Rauber and F. Sebastiani and I. Frommholz},
volume = {6273},
series = {Lecture Notes of Computer Science (LNCS)},
pages = {413-416},
address = {Glasgow (UK)},
month = sep,
publisher = {Springer},
note = {Available at http://gipp.com/pub}
}