Information Science
Prof. Dr. Bela Gipp

Login |

CitePlag - Citation-based Plagiarism Detection

CitePlag is the first plagiarism detection system to implement Citation-based Plagiarism Detection (CbPD) – a novel approach capable of detecting also heavily disguised plagiarism in academic texts. The system is available as open source here.

Existing software only examines literal text similarity to detect plagiarism, and thus typically fails to detect disguised plagiarism forms, including paraphrases, translations, or idea plagiarism. CbPD addresses this shortcoming by additionally analyzing the citation placement in the full-text of documents to form a language-independent semantic “fingerprint” of document similarity.

CitePlag implements several citation-based algorithms to analyze the citation patterns of publications. The screenshot shows two publications visualized in the CitePlag prototype. Matching citations are highlighted and connected in a central column for quick document examination. The documents share no literal text similarity: the left publication is in English and the right in Chinese. However, one can see that the overlap of citations is high, and the order in which sources are cited is nearly identical in several paragraphs.


OriginStamp - Trusted Time Stamping via Bitcoin

OriginStamp is a web-based, trusted timestamping service that uses the decentralized Bitcoin block chain to store anonymous, tamper-proof time stamps for any digital content. OriginStamp allows users to hash files, emails, or plain text, and subsequently store the created hashes in the Bitcoin block chain as well as retrieve and verify time stamps that have been committed to the block chain. OriginStamp is free of charge and easy to use and thus allows anyone, e.g., students, researchers, authors, journalists, or artists, to prove that they were the originator of certain information at a given point in time. The procedures maintain complete privacy of your data. Common use cases of OriginStamp include proving that:

  • a contract has been signed or a tasks was completed prior to a certain date.
  • a photo or video has been recorded prior to a certain date.
  • an idea for a patent already existed prior to a certain date, e.g., prior to signing a NDA.

The idea of timestamping is not new. Even before computers existed, information could be encoded and the code could be published, for example, in a newspaper. However, we use the block chain of the crypto currency Bitcoin as a decentralized, tamper proof, and cost-efficient timestamping authority.

To see the OriginStamp project for yourself, please visit:

Docear - Academic Literature Management via Mind Maps

Docear is an open source software for literature management that is tailored to the needs of students, researchers, and academics. Docear bundles multiple tools for academic literature and knowledge management into a single interface using mind maps. 

Watch the video below or visit the website for details on this ongoing project.

Docear video available at:

CITREC - Open Evaluation Framework for Citation-based Similarity Measures

CITREC is an open evaluation framework for citation-based and text-based similarity measures. CITREC prepares the data of two formerly separate collections for a citation-based analysis and provides the tools necessary for performing evaluations of similarity measures. The first collection is the PubMed Central Open Access Subset (PMC OAS), the second is the collection used for the Genomics Tracks at the Text REtrieval Conferences (TREC) ’06 and ’07 (overview paper for the 2006 TREC Gen collection).

CITREC extends the PMC OAS and TREC Genomics collections by providing:

  1. citation and reference information that includes the position of in-text citations for documents in both collections;
  2. code and pre-computed scores for 35 citation-based and text-based similarity measures;
  3. two gold standards based on Medical Subject Headings (MeSH) descriptors and the relevance feedback gathered for the TREC Genomics collection;
  4. a web-based system (Literature Recommendation Evaluator – LRE) that allows evaluating similarity measures on their ability to identify documents that are relevant to user-defined information needs;
  5. tools to statistically analyze and compare the scores that individual similarity measures yield.

news-please - an integrated web crawler and information extractor for news

news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website. news-please also features a library mode, which allows developers to use the crawling and extraction functionality within their own program.

The core functionalities include:

  • full website crawling (users only need to provide the root URL)
  • crawling of recent (using RSS) and old articles (using sitemaps and recursive link analysis)
  • information extraction with a precision of 0.7
  • runs in two modes: CLI or can be accessed via an API in your own code


Co-Citation Proximity Analysis - Recommendation and Clustering Algorithms for Academic Literature

Co-Citation Proximity Analysis (CPA) is a method to compute both local and global instances of semantic similarity in academic documents by examining citation proximity in the full texts of documents.

CPA was developed with two applications in mind: recommender systems and clustering.
Regarding the first application, an improved measure of document semantic similarity, which computes similarity at a more fine-grained resolution, has the potential to significantly improve the relevance of academic literature recommendations. Regarding the second application, a more granular measure of document similarity allows the development of more precise clustering algorithms for academic literature.


Mr. DLib - Machine-readable Digital Library

Mr. DLib's "Recommendations as a Service" (RaaS) allows operators of academic products to easily integrate a scientific recommender system into their products. The basic idea of Mr. DLib's scientific recommender system is to calculate recommendations for research articles, call for papers, grants, etc. on Mr. DLib's server. Operators of academic products may then request recommendations from Mr. DLib and display the recommendations to their users.

This service:

  • helps academic service providers enhance their own portfolio, e.g., by providing more precise literature recommendations
  • supports researchers in need for (large amounts of) bibliographic data or scholarly full-texts, e.g., to perform impact or trend analyses
  • provides a base to other agents for building own services upon the data of Mr. DLib.

Bibliographic Metadata Extraction

For our projects such as Citation-based Plagiarism DetectionCo-Citation Proximity Analysis or Mr. DLib we depend on the availability of bibliographic metadata. Author names, title information, references and citations must be accessible and ideally error-free. We improved existing tools, and developed our own tools to extract all the required information from PDF files.


Past Projects