-
-
Couldn't load subscription status.
- Fork 61
Labels
help wantedOpen to participation from the communityOpen to participation from the community✨ goal: improvementImprovement to an existing featureImprovement to an existing feature🏁 status: ready for workReady for workReady for work💻 aspect: codeConcerns the software code in the repositoryConcerns the software code in the repository🟩 priority: lowLow priority and doesn't need to be rushedLow priority and doesn't need to be rushed
Description
Problem
We currently lack a mechanism for systematically and regularly harvesting license-aware metadata from the Internet Archive (IA), a massive source of CC-licensed content. This data is essential for understanding the large-scale adoption and usage of Creative Commons licenses on the platform.
Proposal / Solution
Implement a new Python script and pipeline specifically designed to query the Internet Archive's API, process the results, and generate aggregated statistics.
Scope of Work (What this pipeline will do)
- Query: Search the IA for all items with a Creative Commons/open-source license.
- Data Fetching: Implement robust API calling with exponential backoff and retry logic to handle network and rate-limiting issues across large result sets.
- Normalization: Clean and standardize the messy raw license URLs provided by the IA using a lookup table (ia_license_mapping.csv).
- Aggregation: Collect counts for licenses, languages, and countries.
- Output: Save the aggregated counts into three separate, version-controlled CSV files.
- Automation: Include optional flags (--enable-save, --enable-git) to support dry runs and automated Git commits/pushes for regular data updates.
Expected Outcome
We have a baseline and a recurring process for measuring CC usage on the Internet Archive.
Metadata
Metadata
Assignees
Labels
help wantedOpen to participation from the communityOpen to participation from the community✨ goal: improvementImprovement to an existing featureImprovement to an existing feature🏁 status: ready for workReady for workReady for work💻 aspect: codeConcerns the software code in the repositoryConcerns the software code in the repository🟩 priority: lowLow priority and doesn't need to be rushedLow priority and doesn't need to be rushed
Type
Projects
Status
Backlog