Skip to content

Implement Internet Archive Data Fetching Pipeline #196

@jessbryte

Description

@jessbryte

Problem

We currently lack a mechanism for systematically and regularly harvesting license-aware metadata from the Internet Archive (IA), a massive source of CC-licensed content. This data is essential for understanding the large-scale adoption and usage of Creative Commons licenses on the platform.

Proposal / Solution

Implement a new Python script and pipeline specifically designed to query the Internet Archive's API, process the results, and generate aggregated statistics.

Scope of Work (What this pipeline will do)

  1. Query: Search the IA for all items with a Creative Commons/open-source license.
  2. Data Fetching: Implement robust API calling with exponential backoff and retry logic to handle network and rate-limiting issues across large result sets.
  3. Normalization: Clean and standardize the messy raw license URLs provided by the IA using a lookup table (ia_license_mapping.csv).
  4. Aggregation: Collect counts for licenses, languages, and countries.
  5. Output: Save the aggregated counts into three separate, version-controlled CSV files.
  6. Automation: Include optional flags (--enable-save, --enable-git) to support dry runs and automated Git commits/pushes for regular data updates.

Expected Outcome

We have a baseline and a recurring process for measuring CC usage on the Internet Archive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions