openRedact - Automatic text redaction, powered by ML

Dataset (`documents_100_annotated.json`)

This repository contains a dataset of 100 Wikipedia articles, which were redacted to remove personal identifiers. The dataset contains two levels of redaction:

H0: Identifiers such as Name, E-Mail, Phonenumber etc., which can be used on their own to identify a person
H1: Auxiluary information such as date of birth, employer or nationality, which can be combined to identify a person

Data Format

The data is in JSON format, with 3 keys for each document:

text: A list of words that comprises the document. The original text can be reconstructed by joining these words with a space character
H0 A list of values (0 or 1), which indicates for each word if it is part of redaction level H0
H1 A list of values (0 or 1), which indicates for each word if it is part of redaction level H1

Redaction and Annotation Tool

The tool that was built in order to create this dataset supports a collaborativ approach for data annotation. It can be used to redact example texts, and then train a classifier on this data. The output of this classifier can then interactively be tested on new texts

Demo: https://openredact.jakobkoehler.de/

The application can be run out-of-the-box with Docker: docker-compose up

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
frontend		frontend
jsonbox @ 7d9fc9a		jsonbox @ 7d9fc9a
redactor		redactor
.gitignore		.gitignore
.gitmodules		.gitmodules
.nojekyll		.nojekyll
README.md		README.md
docker-compose.yml		docker-compose.yml
documents_1000_raw.json		documents_1000_raw.json
documents_100_annotated.json		documents_100_annotated.json
labels_96_train.json		labels_96_train.json
research-paper-draft.pdf		research-paper-draft.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

openRedact - Automatic text redaction, powered by ML

Dataset (`documents_100_annotated.json`)

Data Format

Redaction and Annotation Tool

Demo: https://openredact.jakobkoehler.de/

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

jkhlr/openRedact

Folders and files

Latest commit

History

Repository files navigation

openRedact - Automatic text redaction, powered by ML

Dataset (documents_100_annotated.json)

Data Format

Redaction and Annotation Tool

Demo: https://openredact.jakobkoehler.de/

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Dataset (`documents_100_annotated.json`)

Packages