Skip to content

Conversation

@Ramses-Njasap
Copy link

@Ramses-Njasap Ramses-Njasap commented Oct 18, 2025

Fixes

Fixes #166 by @TimidRobot

Description

This pull request implements Phase 2 processing for GitHub data by adding the github_process.py script. It reads GitHub CC license usage data collected in Phase 1, applies cleaning and transformation, maps LICENSE identifier to official Creative Commons legal tool identifiers, and generates a summary CSV file for reporting.

This processing step prepares GitHub license statistics for use in Phase 3 reporting and future quarterly comparisons.


Technical details

  • Script location: scripts/2-process/github_process.py
  • Input file: data/{year}Q{quarter}/1-fetch/github_1_count.csv
  • Output file: data/{year}Q{quarter}/2-process/github_summary.csv
  • CLI options:
    • --enable-save: writes the summary file to disk
    • --enable-git: optionally commits and pushes the generated file
  • Mappings added: LICENSE identifiers converted to CC legal tool identifiers using an inline mapping based on official license standards
  • Error handling: raises QuantifyingException if input data is missing

Tests

Steps to test:

  1. Run the script:
    pipenv run python scripts/2-process/github_process.py --enable-save
    
  2. Confirm output file is created:
    data/{year}Q{quarter}/2-process/github_summary.csv
    
  3. Inspect the CSV and verify:
    License identifiers use CC legal tool format (e.g. CC-BY-4.0)
    Totals are correct and a TOTAL row is included

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow [best practices][best_practices].
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin vbnet Copy code Developer Certificate of Origin Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

@Ramses-Njasap Ramses-Njasap requested review from a team as code owners October 18, 2025 15:36
@Ramses-Njasap Ramses-Njasap requested review from TimidRobot and possumbilities and removed request for a team October 18, 2025 15:36
@cc-open-source-bot cc-open-source-bot moved this to In review in TimidRobot Oct 18, 2025
wordcloud = "*"

[dev-packages]
black = "*"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ramses-Njasap

This file is not to be included in your PR.

Pipfile.lock Outdated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file too is not to be part of your PR.

@Ramses-Njasap
Copy link
Author

Ramses-Njasap commented Oct 19, 2025

Hello @Babi-B , I have made the highlighed changes. Please, check it out.

@Babi-B
Copy link

Babi-B commented Oct 21, 2025

Hi @Ramses-Njasap!

I noticed your PR hasn’t been reviewed yet. You could drop a message on Zulip tagging @TimidRobot to share what you’ve done and request feedback.

@Ramses-Njasap
Copy link
Author

Hello @Babi-B ,

Thank you for the advice . I'll tag him in the Zulip group . As at now I can't continue with the other task when this has not been validated (the tasks are connected)

@TimidRobot
Copy link
Member

@Ramses-Njasap

This pull request (PR) is unacceptable due to a failure to follow the PR template instructions.

The Checklist instructions include:

<!-- DON'T remove this section or any of the lines. -->
<!-- Leave incomplete or inapplicable lines unchecked. -->
<!-- Replace the [ ] with [x] to check the boxes (there is no space between x and square brackets). -->

The template is located here: creativecommons/.github/blob/main/.github/PULL_REQUEST_TEMPLATE.md

Pull requests without the Developer Certificate of Origin section won't be accepted 🙅🏻

@Ramses-Njasap
Copy link
Author

@TimidRobot ,
I have updated the PR to follow template instructions. Please, can you look into it and give me feedback while I continue on other tasks ?

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processing script should be worked on as the same time as the reporting script. There should be a 1:1 relationship between the CSV files and the plots.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be removed from the pull request (PR) and NOT deleted (removed form the project)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be removed from the pull request (PR) and NOT deleted (removed form the project)

Comment on lines +24 to +27
SPDX_TO_CC_LICENSE = {
"CC0-1.0": "zero_1.0",
"CC-BY-4.0": "by_4.0",
"CC-BY-SA-4.0": "by-sa_4.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not CC license identifiers

}

# Licenses outside Creative Commons are kept unchanged
NON_CC_LICENSES = {"0BSD", "MIT-0", "Unlicense", "N/A"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'N/A' stands for "not applicable", it is not a license.

save_summary(summary, args)


if __name__ == "__main__":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section should match:

if __name__ == "__main__":
try:
main()
except shared.QuantifyingException as e:
if e.exit_code == 0:
LOGGER.info(e.message)
else:
LOGGER.error(e.message)
sys.exit(e.exit_code)
except SystemExit as e:
if e.code != 0:
LOGGER.error(f"System exit with code: {e.code}")
sys.exit(e.code)
except KeyboardInterrupt:
LOGGER.info("(130) Halted via KeyboardInterrupt.")
sys.exit(130)
except Exception:
traceback_formatted = textwrap.indent(
highlight(
traceback.format_exc(),
PythonTracebackLexer(),
TerminalFormatter(),
),
" ",
)
LOGGER.critical(f"(1) Unhandled exception:\n{traceback_formatted}")
sys.exit(1)

@TimidRobot TimidRobot self-assigned this Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Improve GitHub processing and reporting

3 participants