Added Internet Archive Fetching Pipeline #195

jessbryte · 2025-10-14T22:26:03Z

Fixes

Fixes Implement Internet Archive Data Fetching Pipeline #196 by @jessbryte

Description

Large-Scale Robust Metadata Fetching: Queries the Internet Archive for items mentioning Creative Commons. Implements retry logic with exponential backoff to ensure resilient API calls against network or rate-limit errors.
License Normalization: Cleans up messy license URLs (e.g., stripping suffixes, enforcing canonical paths) using a pre-defined mapping from ia_license_mapping.csv. Logs any unmapped URLs for review.
Aggregated Output: Processes the collected data and saves the aggregated counts for Licenses, Languages, and Countries into three separate CSV files.

Technical details

Core Logic

The query_internet_archive function is the core fetching mechanism:

Query: Uses text:creativecommons.org and requests minimal fields for lightweight processing.
Pagination: Fetches data in large chunks (up to 100,000 rows at a time) using a continuous while True loop until no more results are available.
Processing: Extracts licenseurl, language, and country for each item. It normalizes the license URL via normalize_license and increments three Counter objects to track aggregated statistics.
Flow: The main() function handles the process: it parses arguments, runs the data query (quite time-consuming).

Tests

Run the fetch script with --enable-save.
Verify that the resulting CSV files are created and contain normalized license labels.
Check logs for accurate item counts and any warnings about unmapped license URLs.

Confirm the script handles malformed or missing license fields without crashing.

Screenshots

Checklist

I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
My pull request doesn't include code or content generated with AI.
My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

TimidRobot

Pull requests (PRs) can't be accepted until the instructions in the description are followed. Please edit the description and follow the instructions in the )

Please run the Static analysis tools.

scripts/1-fetch/internetarchive_fetch.py

Updated Internet Archive Fetching Pipeline

jessbryte · 2025-10-16T13:03:32Z

I have made the requested changes and run static analysis @TimidRobot the code is ready for review.

TimidRobot

Please note that the jurisdiction of ported licenses is not necessarily representative of the country of a given work. It should not be used indicate country.

data/ia_license_mapping.csv

scripts/1-fetch/internetarchive_fetch.py

Pipfile

Pipfile.lock

… normalize license

Internet archive update

jessbryte · 2025-10-20T12:20:02Z

@TimidRobot I have made the requested changes and the code is ready for review.

TimidRobot

Please update to import ArchiveSession, configure it, and then use it's search_items method.

You should be able to configure ArchiveSession in the same way that requests.Session is in:

quantifying/scripts/1-fetch/github_fetch.py

Lines 92 to 105 in 9aa5a8f

    
           def get_requests_session(): 
        
               max_retries = Retry( 
        
                   total=5, 
        
                   backoff_factor=10, 
        
                   status_forcelist=GITHUB_RETRY_STATUS_FORCELIST, 
        
               ) 
        
               session = requests.Session() 
        
               session.mount("https://", HTTPAdapter(max_retries=max_retries)) 
        
               headers = {"accept": "application/vnd.github+json"} 
        
               if GH_TOKEN: 
        
                   headers["authorization"] = f"Bearer {GH_TOKEN}" 
        
               session.headers.update(headers) 
        
               return session

Please put STATUS_FORCELIST and USER_AGENT in shared.py. For example:

quantifying/scripts/shared.py

Lines 10 to 22 in fda007c

    
           STATUS_FORCELIST = [ 
        
               408,  # Request Timeout 
        
               422,  # Unprocessable Content (Validation failed,endpoint spammed, etc.) 
        
               429,  # Too Many Requests 
        
               500,  # Internal Server Error 
        
               502,  # Bad Gateway 
        
               503,  # Service Unavailable 
        
               504,  # Gateway Timeout 
        
           ] 
        
           USER_AGENT = ( 
        
               "QuantifyingTheCommons/1.0 " 
        
               "(https://github.com/creativecommons/quantifying)" 
        
           )

It looks like this is a difficult data source as far as the lack of normalization. Please work towards reducing the UNKNOWNS. Also, I would output them as errors and not save them into the data (or combine them into a single item if they can't be resolved).

TimidRobot · 2025-10-20T12:58:03Z

scripts/1-fetch/internetarchive_fetch.py

Please make this script executable.

References:

https://github.com/creativecommons/quantifying#running-the-scripts

https://opensource.creativecommons.org/contributing-code/foundational-tech/#file-permissions

scripts/1-fetch/internetarchive_fetch.py

Pipfile

jessbryte · 2025-10-24T04:25:26Z

I apologize. Something came up and I was a bit occupied in the past few days. I am on this now.

- Normalize localized license URLs (e.g. /deed.de → base URL) - Add fallback resolution for ISO 639-2/B codes and locale tags in language normalization - Skip counting or saving results with UNKNOWN license or language - Log unmapped values as errors - Track unmapped license URLs and languages for diagnostics

## Summary of Changes ### Code Updates - **Removed port/jurisdiction counting** - **Dropped license url column on csv output files** - **Create session upon searching internet archive** - **Improved `normalize_license()`**: - Strips localized `/deed.xx` and `/legalcode.xx` suffixes - Normalizes scheme, host, and path for consistent mapping - **Enhanced `normalize_language()`**: - Adds fallback resolution for ISO 639-2/B codes (e.g. `"ger"` → `"German"`) - Normalizes languages using `pycountry` library - **Filtering Logic**: - Skips counting or saving results with `"UNKNOWN"` license or language - Logs unmapped values as `ERROR` for traceability - **Diagnostics**: - Tracks `unmapped_licenseurl_counter` and `unmapped_language_counter`

jessbryte · 2025-10-25T14:21:34Z

@TimidRobot I have made the requested changes and the code is ready for review.

Add Internet Archive data fetching functionality

24b5003

jessbryte requested review from a team as code owners October 14, 2025 22:26

jessbryte requested review from Shafiya-Heena and TimidRobot and removed request for a team October 14, 2025 22:26

cc-open-source-bot moved this to In review in TimidRobot Oct 14, 2025

cc-open-source-bot added this to TimidRobot Oct 14, 2025

TimidRobot requested changes Oct 15, 2025

View reviewed changes

scripts/1-fetch/internetarchive_fetch.py Outdated Show resolved Hide resolved

scripts/1-fetch/internetarchive_fetch.py Outdated Show resolved Hide resolved

scripts/1-fetch/internetarchive_fetch.py Outdated Show resolved Hide resolved

TimidRobot self-assigned this Oct 15, 2025

jessbryte and others added 3 commits October 16, 2025 13:26

Merge branch 'creativecommons:main' into main

2fc1f72

Cleaned up static analysis issues in internetarchive_fetch.py

3d18a74

Merge pull request #1 from jessbryte/internet-archive-update

9cb806c

Updated Internet Archive Fetching Pipeline

jessbryte requested a review from TimidRobot October 16, 2025 13:07

TimidRobot requested changes Oct 17, 2025

View reviewed changes

jessbryte and others added 4 commits October 20, 2025 10:33

Revert Pipfile and Pipfile.lock to state before last push

b7873cc

Refactor Internet Archive fetch script: use pycountry, count by port,…

c77e23f

… normalize license

Revert Pipfile.lock

0597061

Merge pull request #2 from jessbryte/internet-archive-update

6924f4a

Internet archive update

jessbryte requested a review from TimidRobot October 20, 2025 12:18

TimidRobot requested changes Oct 20, 2025

View reviewed changes

jessbryte and others added 5 commits October 24, 2025 06:28

Revert Pipfile to remove pycountry

a22ede1

Merge branch 'creativecommons:main' into main

c47fd55

Merge branch 'main' into internet-archive-update

7c1f97a

Make internetarchive_fetch.py executable for direct script usage

903190f

	def get_requests_session():
	max_retries = Retry(
	total=5,
	backoff_factor=10,
	status_forcelist=GITHUB_RETRY_STATUS_FORCELIST,
	)
	session = requests.Session()
	session.mount("https://", HTTPAdapter(max_retries=max_retries))
	headers = {"accept": "application/vnd.github+json"}
	if GH_TOKEN:
	headers["authorization"] = f"Bearer {GH_TOKEN}"
	session.headers.update(headers)

	return session

	STATUS_FORCELIST = [
	408, # Request Timeout
	422, # Unprocessable Content (Validation failed,endpoint spammed, etc.)
	429, # Too Many Requests
	500, # Internal Server Error
	502, # Bad Gateway
	503, # Service Unavailable
	504, # Gateway Timeout
	]
	USER_AGENT = (
	"QuantifyingTheCommons/1.0 "
	"(https://github.com/creativecommons/quantifying)"
	)

Uh oh!

Added Internet Archive Fetching Pipeline #195

Are you sure you want to change the base?

Added Internet Archive Fetching Pipeline #195

Conversation

jessbryte commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Description

Technical details

Core Logic

Tests

Screenshots

Checklist

Developer Certificate of Origin

Uh oh!

TimidRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jessbryte commented Oct 16, 2025

Uh oh!

TimidRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jessbryte commented Oct 20, 2025

Uh oh!

TimidRobot left a comment

Choose a reason for hiding this comment

Uh oh!

TimidRobot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jessbryte commented Oct 24, 2025

Uh oh!

jessbryte commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jessbryte commented Oct 14, 2025 •

edited

Loading