CITREC is an open evaluation framework for citation-based and text-based similarity measures.
CITREC Overview Paper:
B. Gipp, N. Meuschke, and M. Lipinski, “CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central,” in Proceedings of the iConference 2015, Newport Beach, California, 2015. (PDF)
CITREC prepares the data of two formerly separate collections for a citation-based analysis and provides the tools necessary for performing evaluations of similarity measures. The first collection is the PubMed Central Open Access Subset (PMC OAS), the second is the collection used for the Genomics Tracks at the Text REtrieval Conferences (TREC) ’06 and ’07 (overview paper for the TREC Gen collection).
CITREC extends the PMC OAS and TREC Genomics collections by providing:
- citation and reference information that includes the position of in-text citations for documents in both collections;
- code and pre-computed scores for 35 citation-based and text-based similarity measures;
- two gold standards based on Medical Subject Headings (MeSH) descriptors and the relevance feedback gathered for the TREC Genomics collection;
- a web-based system (Literature Recommendation Evaluator – LRE) that allows evaluating similarity measures on their ability to identify documents that are relevant to user-defined information needs;
- tools to statistically analyze and compare the scores that individual similarity measures yield.
The demo database (User: citrec_demo / Password: citrec) allows you to get a first impression of the data that CITREC offers and the kind of analysis the framework allows performing.
This Excel spreadsheet exemplifies a possible evaluation using CITREC data. The spreadsheet compares the scores calculated using different similarity measures dependent on the maximum Co-Citation score (i).
- Database Overview and Tutorial explaining the structure of the CITREC database and demonstrating the usage of the demo system.
- Overview of Similarity Tables listing the similarity measures included in the CITREC framework and explaining the naming conventions for the database tables that contain the similarity scores calculated using the individual measures.
- Parser Documentation explaining the procedures for data extraction and cleaning.
- LRE Documentation describing the web-based SciPlore Literature Recommendation Evaluator, which allows surveys to gather relevance feedback and establish gold standard datasets.
PubMed Central Open Access Subset
- Database Schema only (1.3 KB)
- Whole Database (5 GB zipped, ~20 GB raw) – includes document metadata, citation data and pre-computed similarity scores
TREC Genomics collection
- Database Schema only (1.2 KB)
- Whole Database (1 GB zipped, ~5 GB raw) – includes document metadata and citation data.
git repository (Bitbucket)
The Java source code includes:
- parsers for the PMC OAS and the TREC Genomics collection as well as tools to retrieve MeSH and article metadata from NCBI resources (package org.sciplore.citrec.dataimport)
- tools to statistically evaluate retrieval results using a top-k or a rank-based analysis (package org.sciplore.citrec.eval)
- implementations of similarity measures and code to calculate the MeSH-based gold standard (package org.sciplore.citrec.sim)
The source code for the Literature Recommendation Evaluator (LRE) uses the symfony (v. 2) PHP framework.
git repository (Bitbucket)
CITREC is an open source project published under the Gnu Public License (GPL) version 2. We warmly invite you to contribute to the continuous development of the framework by sharing results and resources related to CITREC.
If you have performed an evaluation using CITREC, developed a similarity measure, a parser, or any other tool that you would like to share, we would be happy to acknowledge and share your work on this page. If you are interested in making your resources available through this page, please contact us at firstname.lastname@example.org.
Document Collections and Metadata
Below, we link to the sources of full texts and metadata that we combined, processed and enhanced as part of the CITREC framework. Please observe the individual licenses of the publishers!
- PubMed Central Open Access Subset PMC OAS data
- TREC 2006 Genomics data (requires registration)
- Medical Subject Headings
- ParsCit – Citation Parser
- SimPack – Java Library for Similarity Measures
- WebLA – Java package for handling Web Graphs that implements popular algorithms such as PageRank, HITS, CoCitation Similarity and SimRank.
We thank everyone contributing to the creation of the TREC Genomics test-collection. Without this great work, the realization of the CITREC framework would not have been possible.
If you experience any problems or would like to contribute to this project, please send us an email: email@example.com