Skip to content

Conversation

@jessbryte
Copy link
Owner

Description

This PR improves license normalization and port jurisdiction handling. It replaces the previous CSV-based country mapping with Python’s built-in ISO country code support via pycountry, and shifts the jurisdiction analysis from country names to port codes (e.g., /nl, /tw). Unported licenses are now explicitly marked with None.

Key Changes

  1. License Normalization:

    • Cleans and standardizes license URLs.
    • Removes suffixes like /legalcode and /deed.
    • Identifies jurisdiction codes from ported URLs.
    • Marks unported licenses with None.
  2. License-Identifier mapping:

  • Renames ia_license_mapping.csv to data/license_url_to_identifier_mapping.csv
  1. Jurisdiction Mapping:

    • Replaces country mapping CSV file with pycountry based mapping.
    • Jurisdiction is now recorded as a port code (e.g., nl, us) or None.
  2. Data Aggregation:

    • Counts licenses by:
    • Total usage
    • Language
    • Jurisdiction port (not country)
  3. Pipfile and Pipefile.lock

  • Reverts these files and adds package pycountry = "*"

Technical Details

  • Updated normalize_license() to return port code or None.
  • Updated headers and output CSVs to reflect PORT instead of COUNTRY.
  • Added pycountry to Pipfile and removed legacy country_mapping.csv.

Tests

  1. Run the fetch script with --enable-save.
  2. Confirm that:
    • CSV files are generated with normalized license labels.
    • Jurisdiction column reflects port codes or None.
    • Logs show accurate item counts and warnings for unmapped licenses.

@jessbryte jessbryte merged commit 6924f4a into main Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant