news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website. news-please also features a library mode, which allows developers to use the crawling and extraction functionality within their own program (Python 2.7 and 3.x).

The core functionalities include:

  • full website crawling (users only need to provide the root URL)
  • crawling of recent (using RSS) and old articles (using sitemaps and recursive link analysis)
  • information extraction with a precision of 0.7
  • runs in two modes: CLI or can be accessed via an API in your own code (as a Python module)


Related Publications

[2017] news-please: A Generic News Crawler and Extractor

F. Hamborg, N. Meuschke, C. Breitinger, and B. Gipp

in Proceedings of the 15th International Symposium on Information Science, 2017