Implement Internet Archive Data Fetching Pipeline

## Problem

We currently lack a mechanism for systematically and regularly harvesting license-aware metadata from the Internet Archive (IA), a massive source of CC-licensed content. This data is essential for understanding the large-scale adoption and usage of Creative Commons licenses on the platform.

## Proposal / Solution
Implement a new Python script and pipeline specifically designed to query the Internet Archive's API, process the results, and generate aggregated statistics.

## Scope of Work (What this pipeline will do)
1. Query: Search the IA for all items with a Creative Commons/open-source license.
2. Data Fetching: Implement robust API calling with exponential backoff and retry logic to handle network and rate-limiting issues across large result sets.
3. Normalization: Clean and standardize the messy raw license URLs provided by the IA using a lookup table (ia_license_mapping.csv).
4. Aggregation: Collect counts for licenses, languages, and countries.
5. Output: Save the aggregated counts into three separate, version-controlled CSV files.
6. Automation: Include optional flags (--enable-save, --enable-git) to support dry runs and automated Git commits/pushes for regular data updates.

## Expected Outcome
We have a baseline and a recurring process for measuring CC usage on the Internet Archive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Implement Internet Archive Data Fetching Pipeline #196

Problem

Proposal / Solution

Scope of Work (What this pipeline will do)

Expected Outcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Implement Internet Archive Data Fetching Pipeline #196

Description

Problem

Proposal / Solution

Scope of Work (What this pipeline will do)

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions