-
-
Notifications
You must be signed in to change notification settings - Fork 60
Added Internet Archive Fetching Pipeline #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull requests (PRs) can't be accepted until the instructions in the description are followed. Please edit the description and follow the instructions in the <!-- HTML comments -->)
Please run the Static analysis tools.
Updated Internet Archive Fetching Pipeline
|
I have made the requested changes and run static analysis @TimidRobot the code is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that the jurisdiction of ported licenses is not necessarily representative of the country of a given work. It should not be used indicate country.
|
@TimidRobot I have made the requested changes and the code is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update to import ArchiveSession, configure it, and then use it's search_items method.
You should be able to configure ArchiveSession in the same way that requests.Session is in:
quantifying/scripts/1-fetch/github_fetch.py
Lines 92 to 105 in 9aa5a8f
| def get_requests_session(): | |
| max_retries = Retry( | |
| total=5, | |
| backoff_factor=10, | |
| status_forcelist=GITHUB_RETRY_STATUS_FORCELIST, | |
| ) | |
| session = requests.Session() | |
| session.mount("https://", HTTPAdapter(max_retries=max_retries)) | |
| headers = {"accept": "application/vnd.github+json"} | |
| if GH_TOKEN: | |
| headers["authorization"] = f"Bearer {GH_TOKEN}" | |
| session.headers.update(headers) | |
| return session |
Please put STATUS_FORCELIST and USER_AGENT in shared.py. For example:
Lines 10 to 22 in fda007c
| STATUS_FORCELIST = [ | |
| 408, # Request Timeout | |
| 422, # Unprocessable Content (Validation failed,endpoint spammed, etc.) | |
| 429, # Too Many Requests | |
| 500, # Internal Server Error | |
| 502, # Bad Gateway | |
| 503, # Service Unavailable | |
| 504, # Gateway Timeout | |
| ] | |
| USER_AGENT = ( | |
| "QuantifyingTheCommons/1.0 " | |
| "(https://github.com/creativecommons/quantifying)" | |
| ) |
It looks like this is a difficult data source as far as the lack of normalization. Please work towards reducing the UNKNOWNS. Also, I would output them as errors and not save them into the data (or combine them into a single item if they can't be resolved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this script executable.
References:
|
I apologize. Something came up and I was a bit occupied in the past few days. I am on this now. |
- Normalize localized license URLs (e.g. /deed.de → base URL) - Add fallback resolution for ISO 639-2/B codes and locale tags in language normalization - Skip counting or saving results with UNKNOWN license or language - Log unmapped values as errors - Track unmapped license URLs and languages for diagnostics
## Summary of Changes ### Code Updates - **Removed port/jurisdiction counting** - **Dropped license url column on csv output files** - **Create session upon searching internet archive** - **Improved `normalize_license()`**: - Strips localized `/deed.xx` and `/legalcode.xx` suffixes - Normalizes scheme, host, and path for consistent mapping - **Enhanced `normalize_language()`**: - Adds fallback resolution for ISO 639-2/B codes (e.g. `"ger"` → `"German"`) - Normalizes languages using `pycountry` library - **Filtering Logic**: - Skips counting or saving results with `"UNKNOWN"` license or language - Logs unmapped values as `ERROR` for traceability - **Diagnostics**: - Tracks `unmapped_licenseurl_counter` and `unmapped_language_counter`
|
@TimidRobot I have made the requested changes and the code is ready for review. |
Fixes
Description
Technical details
Core Logic
The query_internet_archive function is the core fetching mechanism:
Tests
Confirm the script handles malformed or missing license fields without crashing.
Screenshots
Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin