Skip to content

This Python program scans pdfs and websites for links and checks if the links are active or return an error code.

License

Notifications You must be signed in to change notification settings

rottingresearch/linkrot

linkrot logo FOSSA Status

Introduction

Scans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.

New in v5.2.2: Retraction checking! linkrot now automatically checks DOIs against retraction databases to identify potentially retracted papers, helping ensure research integrity.

Check out our sister project, Rotting Research, for a web app implementation of this project.

Features

  • Extract references and metadata from a given PDF.
  • Detects PDF, URL, arXiv and DOI references.
  • Check DOIs for retracted papers (using the -r flag).
  • Archives valid links using Internet Archive's Wayback Machine (using the -a flag).
  • Checks for valid SSL certificate.
  • Find broken hyperlinks (using the -c flag).
  • Output as text or JSON (using the -j flag).
  • Extract the PDF text (using the --text flag).
  • Use as command-line tool or Python package.
  • Works with local and online PDFs.

Installation

PyPI (Recommended)

Grab a copy of the code with pip:

pip install linkrot

Debian/Ubuntu Package

For Debian/Ubuntu systems, you can build and install a .deb package:

# Install build dependencies
sudo apt-get install dpkg-dev debhelper dh-python python3-setuptools

# Build the package
python3 setup-deb-build.py
./build-deb.sh

# Install the packages
sudo dpkg -i ../python3-linkrot_*.deb ../linkrot_*.deb
sudo apt-get install -f  # Fix any dependency issues

See debian/README.md for detailed packaging instructions.

Usage

linkrot can be used to extract info from a PDF in two ways:

  • Command line/Terminal tool linkrot
  • Python library import linkrot

1. Command Line/Terminal tool

linkrot [pdf-file-or-url]

Run linkrot -h to see the help output:

linkrot -h

usage:

linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-r] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

Extract metadata and references from a PDF, and optionally download all referenced PDFs.

Arguments

positional arguments:

pdf (Filename or URL of a PDF file)

optional arguments:

-h, --help            (Show this help message and exit)  
-d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
-c, --check-links     (Check for broken links)  
-r, --check-retractions (Check DOIs for retracted papers)
-j, --json            (Output infos as JSON (instead of plain text))  
-v, --verbose         (Print all references (instead of only PDFs))  
-t, --text            (Only extract text (no metadata or references))  
-a, --archive	  (Archive actvice links)
-o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
--version             (Show program's version number and exit)  

PDF Samples

For testing purposes, you can find PDF samples in shared MEGA folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).

Examples

Extract text to console.

linkrot https://example.com/example.pdf -t

Extract text to file

linkrot https://example.com/example.pdf -t -o pdf-text.txt

Check Links

linkrot https://example.com/example.pdf -c

Check for Retracted Papers

linkrot https://example.com/example.pdf -r

Check Both Links and Retractions

linkrot https://example.com/example.pdf -c -r

Get Results as JSON with Retraction Check

linkrot https://example.com/example.pdf -r -j

2. Main Python Library

Import the library:

import linkrot

Create an instance of the linkrot class like so:

pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class

Now the following function can be used to extract specific data from the pdf:

get_metadata()

Arguments: None

Usage:

metadata = pdf.get_metadata() #pdf is the instance of the linkrot class

Return type: Dictionary <class 'dict'>

Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...

get_text()

Arguments: None

Usage:

text = pdf.get_text() #pdf is the instance of the linkrot class

Return type: String <class 'str'>

Information Provided: The entire content of the PDF in string form.

get_references(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_list = pdf.get_references() #pdf is the instance of the linkrot class

Return type: Set <class 'set'> of <linkrot.backends.Reference object>

linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced

Information Provided: All references with their corresponding type and page number.

get_references_as_dict(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class

Return type: Dictionary <class 'dict'> with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list <class 'list'> of refs of that type.

Information Provided: All references in their corresponding type list.

download_pdfs(target_dir)

Arguments:

target_dir: The path of the directory to which the reference PDFs should be downloaded 

Usage:

pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class

Return type: None

Information Provided: Downloads all the reference PDFs to the specified directory.

3. Linkrot downloader functions

Import:

from linkrot.downloader import sanitize_url, get_status_code, check_refs

sanitize_url(url)

Arguments:

url: The url to be sanitized.

Usage:

new_url = sanitize_url(old_url) 

Return type: String <class 'str'>

Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.

get_status_code(url)

Arguments:

url: The url to be checked for its status. 

Usage:

status_code = get_status_code(url) 

Return type: String <class 'str'>

Information Provided: Checks if the URL is active or broken.

check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)

Arguments:

refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading

Usage:

check_refs(pdf.get_references()) #pdf is the instance of the linkrot class

Return type: None

Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.

4. Linkrot extractor functions

Import:

from linkrot.extractor import extract_urls, extract_doi, extract_arxiv

Get pdf text:

text = pdf.get_text() #pdf is the instance of the linkrot class

extract_urls(text)

Arguments:

text: String of text to extract urls from

Usage:

urls = extract_urls(text)

Return type: Set <class 'set'> of URLs

Information Provided: All URLs in the text

extract_arxiv(text)

Arguments:

text: String of text to extract arXivs from

Usage:

arxiv = extract_arxiv(text)

Return type: Set <class 'set'> of arxivs

Information Provided: All arXivs in the text

extract_doi(text)

Arguments:

text: String of text to extract DOIs from

Usage:

doi = extract_doi(text)

Return type: Set <class 'set'> of DOIs

Information Provided: All DOIs in the text

5. Linkrot retraction functions

Import:

from linkrot.retraction import check_dois_for_retractions, RetractionChecker

check_dois_for_retractions(dois, verbose=False)

Arguments:

dois: Set of DOI strings to check for retractions
verbose: Whether to print detailed results

Usage:

# Get DOIs from PDF text
text = pdf.get_text()
dois = extract_doi(text)

# Check for retractions
result = check_dois_for_retractions(dois, verbose=True)

Return type: Dictionary with retraction results and summary

Information Provided: Checks each DOI against retraction databases and provides detailed information about any retracted papers found.

RetractionChecker class

For more advanced usage, you can use the RetractionChecker class directly:

checker = RetractionChecker()

# Check individual DOI
result = checker.check_doi("10.1000/182")

# Check multiple DOIs
results = checker.check_multiple_dois({"10.1000/182", "10.1038/nature12373"})

# Get summary
summary = checker.get_retraction_summary(results)

The retraction checker uses multiple methods to detect retractions:

  • CrossRef API for retraction notices in metadata
  • Analysis of DOI landing pages for retraction indicators
  • Extensible design for adding more retraction databases

Code of Conduct

To view our code of conduct please visit our Code of Conduct page.

License

This program is licensed with an GPLv3 License.

FOSSA Status

About

This Python program scans pdfs and websites for links and checks if the links are active or return an error code.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 17