Skip to content

Conversation

@jessbryte
Copy link

@jessbryte jessbryte commented Oct 14, 2025

Fixes

Description

  1. Large-Scale Robust Metadata Fetching: Queries the Internet Archive for items mentioning Creative Commons. Implements retry logic with exponential backoff to ensure resilient API calls against network or rate-limit errors.
  2. License Normalization: Cleans up messy license URLs (e.g., stripping suffixes, enforcing canonical paths) using a pre-defined mapping from ia_license_mapping.csv. Logs any unmapped URLs for review.
  3. Aggregated Output: Processes the collected data and saves the aggregated counts for Licenses, Languages, and Countries into three separate CSV files.

Technical details

Core Logic

The query_internet_archive function is the core fetching mechanism:

  1. Query: Uses text:creativecommons.org and requests minimal fields for lightweight processing.
  2. Pagination: Fetches data in large chunks (up to 100,000 rows at a time) using a continuous while True loop until no more results are available.
  3. Processing: Extracts licenseurl, language, and country for each item. It normalizes the license URL via normalize_license and increments three Counter objects to track aggregated statistics.
  4. Flow: The main() function handles the process: it parses arguments, runs the data query (quite time-consuming).

Tests

  1. Run the fetch script with --enable-save.
  2. Verify that the resulting CSV files are created and contain normalized license labels.
  3. Check logs for accurate item counts and any warnings about unmapped license URLs.

Confirm the script handles malformed or missing license fields without crashing.

Screenshots

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@jessbryte jessbryte requested review from a team as code owners October 14, 2025 22:26
@jessbryte jessbryte requested review from Shafiya-Heena and TimidRobot and removed request for a team October 14, 2025 22:26
@cc-open-source-bot cc-open-source-bot moved this to In review in TimidRobot Oct 14, 2025
Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull requests (PRs) can't be accepted until the instructions in the description are followed. Please edit the description and follow the instructions in the <!-- HTML comments -->)

Please run the Static analysis tools.

@TimidRobot TimidRobot self-assigned this Oct 15, 2025
@jessbryte
Copy link
Author

I have made the requested changes and run static analysis @TimidRobot the code is ready for review.

@jessbryte jessbryte requested a review from TimidRobot October 16, 2025 13:07
Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that the jurisdiction of ported licenses is not necessarily representative of the country of a given work. It should not be used indicate country.

@jessbryte jessbryte requested a review from TimidRobot October 20, 2025 12:18
@jessbryte
Copy link
Author

@TimidRobot I have made the requested changes and the code is ready for review.

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to import ArchiveSession, configure it, and then use it's search_items method.

You should be able to configure ArchiveSession in the same way that requests.Session is in:

def get_requests_session():
max_retries = Retry(
total=5,
backoff_factor=10,
status_forcelist=GITHUB_RETRY_STATUS_FORCELIST,
)
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=max_retries))
headers = {"accept": "application/vnd.github+json"}
if GH_TOKEN:
headers["authorization"] = f"Bearer {GH_TOKEN}"
session.headers.update(headers)
return session

Please put STATUS_FORCELIST and USER_AGENT in shared.py. For example:

STATUS_FORCELIST = [
408, # Request Timeout
422, # Unprocessable Content (Validation failed,endpoint spammed, etc.)
429, # Too Many Requests
500, # Internal Server Error
502, # Bad Gateway
503, # Service Unavailable
504, # Gateway Timeout
]
USER_AGENT = (
"QuantifyingTheCommons/1.0 "
"(https://github.com/creativecommons/quantifying)"
)

It looks like this is a difficult data source as far as the lack of normalization. Please work towards reducing the UNKNOWNS. Also, I would output them as errors and not save them into the data (or combine them into a single item if they can't be resolved).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessbryte
Copy link
Author

I apologize. Something came up and I was a bit occupied in the past few days. I am on this now.

jessbryte and others added 5 commits October 24, 2025 06:28
- Normalize localized license URLs (e.g. /deed.de → base URL)
- Add fallback resolution for ISO 639-2/B codes and locale tags in language normalization
- Skip counting or saving results with UNKNOWN license or language
- Log unmapped values as errors
- Track unmapped license URLs and languages for diagnostics
## Summary of Changes

###  Code Updates
- **Removed port/jurisdiction counting**

- **Dropped license url column on csv output files**

- **Create session upon searching internet archive**

-  **Improved `normalize_license()`**:
  - Strips localized `/deed.xx` and `/legalcode.xx` suffixes
  - Normalizes scheme, host, and path for consistent mapping

- **Enhanced `normalize_language()`**:
  - Adds fallback resolution for ISO 639-2/B codes (e.g. `"ger"` → `"German"`)
  - Normalizes languages using `pycountry` library

- **Filtering Logic**:
  - Skips counting or saving results with `"UNKNOWN"` license or language
  - Logs unmapped values as `ERROR` for traceability

- **Diagnostics**:
  - Tracks `unmapped_licenseurl_counter` and `unmapped_language_counter`
@jessbryte
Copy link
Author

@TimidRobot I have made the requested changes and the code is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Implement Internet Archive Data Fetching Pipeline

2 participants