This repository contains a dataset of 100 Wikipedia articles, which were redacted to remove personal identifiers. The dataset contains two levels of redaction:
- H0: Identifiers such as Name, E-Mail, Phonenumber etc., which can be used on their own to identify a person
- H1: Auxiluary information such as date of birth, employer or nationality, which can be combined to identify a person
The data is in JSON format, with 3 keys for each document:
text: A list of words that comprises the document. The original text can be reconstructed by joining these words with a space characterH0A list of values (0or1), which indicates for each word if it is part of redaction level H0H1A list of values (0or1), which indicates for each word if it is part of redaction level H1
The tool that was built in order to create this dataset supports a collaborativ approach for data annotation. It can be used to redact example texts, and then train a classifier on this data. The output of this classifier can then interactively be tested on new texts
The application can be run out-of-the-box with Docker: docker-compose up