For our projects such as Citation-based Plagiarism Detection, Co-Citation Proximity Analysis or Mr. DLib we depend on the availability of bibliographic metadata. Author names, title information, references and citations must be accessible and ideally error-free. We improved existing tools, and developed our own tools to extract all the required information from PDF files.
Headerdata Extraction Framework
To obtain general article metadata, such as title, authors, affiliations, journal and DOI from PDF documents, we reviewed the most promising tools available for this task and found that each tool comes with its individual strengths and weaknesses.
Instead of picking only a single tool for the entire task, we developed a framework to select and combine the best metadata extraction tool for the individual tasks.
The framework takes PDF documents as input and returns the extracted metadata as a unified data structure. By handling the execution of specific tools through modules of the framework one can change and substitute specific tools easily. Currently, we are working on using the framework to construct a hybrid approach that combines the best results yielded by the different extraction tools.
Advanced Automated Citation Extraction
Accurate information on citation position (location in the full-text) is required to perform Citation-based Plagiarism Detection and Co-Citation Proximity Analysis. In our review of available citation extraction tools, we found that none of them allow for a sophisticated position analysis.
We chose to enhance existing Open Source tools with methods which identify the position of citations at the character, sentence and section level of the text. We developed an enhanced version of the Open Source tool ParsCit, since it yielded very good parsing results. In the future, we intend to improve more tools in a similar manner.