Skip to content

Conversation

@Goziee-git
Copy link

@Goziee-git Goziee-git commented Oct 11, 2025

Fixes

Description

Implements comprehensive arXiv data collection system to quantify open access academic papers in the commons.

Type of Change

  • New feature implementing data collection from the arXivopen access academic papers
  • Data source addition/modification

Changes Made

  • Added arXiv API integration for fetching academic paper metadata using the arXiv API as project requirement for automation of fetching new data sources.
  • Implemented data processing pipeline for arXiv submissions in the scripts/1-fetch/arxiv_fetch.py
  • Created filtering logic for open access and CC-licensed papers
  • Added arXiv data to quarterly reporting system

Testing

  • Static analysis passes (./dev/check.sh)
  • arXiv API integration tested with sample queries
  • Data processing validated with test dataset

Data Impact

  • New data source added (arXiv academic papers)
  • Report generation affected (new academic commons metrics)

Related Documentation

  • Updated sources.md with arXiv API credentials setup
  • Added arXiv processing documentation

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@Goziee-git Goziee-git requested review from a team as code owners October 11, 2025 23:29
@Goziee-git Goziee-git requested review from TimidRobot and possumbilities and removed request for a team October 11, 2025 23:29
@cc-open-source-bot cc-open-source-bot moved this to In review in TimidRobot Oct 11, 2025
@Goziee-git Goziee-git changed the title Add arXiv data fetching and processing functionality Add arXiv data fetching functionality Oct 12, 2025
Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start.

I recommend also developing a data/report plan. For example:

  • It is not meaningful to get a count of a single language (though it is worth noting that other languages are not available).
  • Category codes converted to reporting (words and/or abbreviations instead of acronyms)

@TimidRobot

This comment was marked as outdated.

@Goziee-git
Copy link
Author

@Goziee-git please follow through on your first pull request (PR) before submitting any more:

Depending on how that one goes, I might reopen this PR.

hello @TimidRobot as requested, I have made changes based on your review. Good work is emphasized over speed and I do hope my attempt to go full circle with other PR hasn't dented my chances of significantly contributing to the project. Thank you🙏🏼

@TimidRobot TimidRobot reopened this Oct 17, 2025
@TimidRobot
Copy link
Member

@Goziee-git ok, please focus on this PR

@Goziee-git
Copy link
Author

Goziee-git commented Oct 20, 2025

This is a great start.

I recommend also developing a data/report plan. For example:

  • It is not meaningful to get a count of a single language (though it is worth noting that other languages are not available).
  • Category codes converted to reporting (words and/or abbreviations instead of acronyms)

@TimidRobot I have removed the query for languages as it returns only English. Also worthy of note here, the arxiv data source accepts papers in other languages but requires that the paper abstracts be submitted in English. so Impossible to get a good distribution of licenses as per language usage.

Also, as suggested. I converted the category codes to reporting words that a more user-friendly and readable using an external arxiv_category_map.yml, in data/2025Q4/1-fetch. I believe this should make updates repoducible and maintainable over time. The script now produces arxiv_2_count_by_category_report.csv and arxiv_2_count_by_category_report_agg.csv for better reporting. Also instead of dumping author count data previously, i implemented a Bucketing approach. in arxiv_4_count_by_author_bucket.csv to group author counts into meaningful ranges (1, 2-3, 4-6, 7-10, 11+). The script also generates a arxiv_provenance.json to record metadata for audit, reproducibility, and provenance

@Goziee-git
Copy link
Author

Goziee-git commented Oct 20, 2025

Hello @TimidRobot, i observed from multiple results fetched previously that the script failed to fetch CC licenses that may be recorded as hyphenated variants like (CC-BY, CC-BY-NC, etc). I have implemented a compiled regex pattern that replaces the string matching for more robust license detection

I have also looked at some of the implementations in other PR to use the normalize_license_text() function for consistent license identification.

Please i'ld like to know what your thoughts are on these changes and work continuously on further improvements, Thanks.

@TimidRobot
Copy link
Member

Also, as suggested. I converted the category codes to reporting words that a more user-friendly and readable using an external arxiv_category_map.yml, in data/2025Q4/1-fetch. I believe this should make updates repoducible and maintainable over time.

  • How was arxiv_category_map.yml created?
    • If by script, it should probably go in dev/
  • Data that persists should go in data/ not a specific quarter directory

The script now produces arxiv_2_count_by_category_report.csv and arxiv_2_count_by_category_report_agg.csv for better reporting. Also instead of dumping author count data previously, i implemented a Bucketing approach. in arxiv_4_count_by_author_bucket.csv to group author counts into meaningful ranges (1, 2-3, 4-6, 7-10, 11+).

I'll look at data after outstanding comments are resolved.

The script also generates a arxiv_provenance.json to record metadata for audit, reproducibility, and provenance

I'll look at data after outstanding comments are resolved. That said, I'm not excited about adding JSON to the project.

@TimidRobot
Copy link
Member

Hello @TimidRobot, i observed from multiple results fetched previously that the script failed to fetch CC licenses that may be recorded as hyphenated variants like (CC-BY, CC-BY-NC, etc). I have implemented a compiled regex pattern that replaces the string matching for more robust license detection

I have also looked at some of the implementations in other PR to use the normalize_license_text() function for consistent license identification.

Please i'ld like to know what your thoughts are on these changes and work continuously on further improvements, Thanks.

It's probably a good idea to create a function in the shared library eventually. Please leave that to last, however.

Refactor arxiv_fetch.py to use requests library for HTTP requests, implementing retry logic for better error handling. Update license extraction logic and CSV headers to remove PLAN_INDEX.
Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unable to test the script.

Command:

Output:

Traceback (most recent call last):
  File "/Users/timidrobot/git/creativecommons/quantifying/./scripts/1-fetch/arxiv_fetch.py", line 18, in <module>
    import feedparser
ModuleNotFoundError: No module named 'feedparser'

return session


def normalize_license_text(raw_text: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this function have types?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this function have types?
@TimidRobot, the type hint in the normalize_license_text() helps to ensure consistency in the data-type expected since entries from the arxiv API can be returned in several data types. So this helps with processing and avoiding type errors. Normalizing the returned value as a string type i figured would also help to Prevents type-related bugs when the returned license identifier is used in dictionaries, CSV writing, or logging. Would you prefer a different approach or implementation here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Goziee-git I meant, "Why does this function alone have types?". It looks like you copied something without understanding it (you should never submit code you don't understand).

Type checking in Python still requires supporting tooling. For example:

File: test.py

def test_type(test: str) -> str:
    print(type(test))


test_type(1)

Command:

python3 ./test.py

Output:

<class 'int'>

Any new tooling should first be decided in an issue and then added in a dedicated pull request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Goziee-git I meant, "Why does this function alone have types?". It looks like you copied something without understanding it (you should never submit code you don't understand).

Type checking in Python still requires supporting tooling. For example:

File: test.py

def test_type(test: str) -> str:
    print(type(test))


test_type(1)

Command:

python3 ./test.py

Output:

<class 'int'>

Any new tooling should first be decided in an issue and then added in a dedicated pull request.

@TimidRobot, I have raised an issue #212 for the type check tooling specifically choosing the mypy python package. I have also read through its usage. The current implementation here only shows type hints and no specific implementation. With your permission, i'ld like that you review the issue and give your recommendations since the implementation here can only be concrete when the issue #212 is resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the type hint. It can be added back when/if type hints are supported by the project.

@Goziee-git
Copy link
Author

Goziee-git commented Oct 22, 2025

I'm unable to test the script.

Command:

Output:

Traceback (most recent call last):
  File "/Users/timidrobot/git/creativecommons/quantifying/./scripts/1-fetch/arxiv_fetch.py", line 18, in <module>
    import feedparser
ModuleNotFoundError: No module named 'feedparser'

@TimidRobot,I ran the project as directed by the project README which requires installing the dependencies first and then running the script from the root of the project using pipenv run. So from the interactive shell with (/Users/timidrobot/git/creativecommons/quantifying) as path the command should be ./scripts/1-fetch/arxiv_fetch.py with required args as desired.

I Think the ModuleNotFoundError is from the missing feedparser module in your pipfile. The arxiv_fetch.py script requires the feedparser module because The script queries ArXiv's API which returns RSS/Atom feeds containing paper metadata. feedparser converts this XML into structured Python objects, allowing the script to extract license information, categories, publication years, and author counts from each paper entry.

Without feedparser, the script can't process ArXiv's API responses. So i think you have to add the feedparser dep to your pipfile to run the script.

@TimidRobot
Copy link
Member

Without feedparser, the script can't process ArXiv's API responses. So i think you have to add the feedparser dep to your pipfile to run the script.

@Goziee-git no, you need to include an updated Pipfile and Pipfile.lock in your pull request.

@Goziee-git
Copy link
Author

Goziee-git commented Oct 23, 2025

Without feedparser, the script can't process ArXiv's API responses. So i think you have to add the feedparser dep to your pipfile to run the script.

@Goziee-git no, you need to include an updated Pipfile and Pipfile.lock in your pull request.

@TimidRobot, since this will require that i add a new dependency/tooling to the Pipefile and the Pipfile.lock would you prefer that i submit the updates for this here without raising an issue for it or otherwise

Comment on lines +53 to +60
FILE_ARXIV_YEAR = shared.path_join(
PATHS["data_1-fetch"], "arxiv_3_count_by_year.csv"
)
FILE_ARXIV_AUTHOR = shared.path_join(
PATHS["data_1-fetch"], "arxiv_4_count_by_author_count.csv"
)
FILE_ARXIV_AUTHOR_BUCKET = shared.path_join(
PATHS["data_1-fetch"], "arxiv_4_count_by_author_bucket.csv"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Goziee-git!

I think you should order your constants

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @TimidRobot!

Given how large the constant list is, would you recommend a dictionary is used here?

For example (R43-R49),

FILE_ARXIV = {
    "count": shared.path_join(PATHS["data_1-fetch"], "arxiv_1_count.csv"),
    "category": shared.path_join(PATHS["data_1-fetch"], "arxiv_2_count_by_category.csv"),
    "category_report": shared.path_join(PATHS["data_1-fetch"], "arxiv_2_count_by_category_report.csv"),
    ...
}

All related constants grouped in the same dictionary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of a dictionary is that it can be acted on programmatically. If that is helpful, a dictionary is a good idea. If not, I tend to prefer individual constants.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Goziee-git!

I think you should order your constants

Really, its a good practice to order constants, and i do have alot of constants. I agree with you that ordering them keeps the flow organised and conforms to proper coding practices. This script is still in development and will need all the insights i can get to make it best for longterm use. I appreciate this here @Babi-B, Thanks

Comment on lines +24 to +25
except Exception:
return {}
Copy link

@Babi-B Babi-B Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Goziee-git!

I have a question about this. Why do you choose to silently swallow up the exceptions? Would it not be better to log an error/warning so the user knows the mapping wasn't loaded?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is still in development and thanks @Babi-B for the notes here. I have looked into it carefully and have implemented changes to properly log exceptions instead of silently handling them.

@@ -0,0 +1,73 @@
#!/usr/bin/env python
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this script meant to be imported purely as a module? If it is, I think it is not needed here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the shebang

data_dir: Directory containing arxiv_category_map.yaml
"""
if not os.path.exists(input_file):
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too...a silent return. Why not a warning or an exception? Could this not lead to confusing downstream errors?


with (
open(input_file, "r") as infile,
open(output_file, "w", newline="") as outfile,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add encoding="utf-8 to prevent platform-dependent encoding issues

]

# Log the start of the script execution
LOGGER.info("Script execution started.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should put this in the main function so it triggers when the script actually runs

def initialize_data_file(file_path, headers):
"""Initialize CSV file with headers if it doesn't exist."""
if not os.path.isfile(file_path):
with open(file_path, "w", newline="") as file_obj:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add encoding="utf-8" here too

retry_strategy = Retry(
total=5,
backoff_factor=1,
status_forcelist=[408, 429, 500, 502, 503, 504],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a STATUS_FORCELIST constant in the shared.py

# author_counts: {license: {author_count(int|None): count}}

# Save license counts
with open(FILE_ARXIV_COUNT, "w", newline="") as fh:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding="utf-8" here too

writer.writerow({"TOOL_IDENTIFIER": lic, "COUNT": c})

# Save detailed category counts (code)
with open(FILE_ARXIV_CATEGORY, "w", newline="") as fh:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding="utf-8"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Goziee-git

I noticed you didn't add encoding="utf-8" to your open() calls when writing to CSVs. I didn't comment on all, but it is important to do so to avoid potential encoding issues where UTF-8 might not be the default

@Babi-B
Copy link

Babi-B commented Oct 24, 2025

Hi @TimidRobot @Goziee-git

I have noticed entries like these in the CSV files:

"TOOL_IDENTIFIER","AUTHOR_COUNT","COUNT"
"UNKNOWN CC legal tool","3","12"
"UNKNOWN CC legal tool","2","18"
"UNKNOWN CC legal tool","1","9"
"UNKNOWN CC legal tool","8","4"
"UNKNOWN CC legal tool","4","13"
"UNKNOWN CC legal tool","10","5"
"UNKNOWN CC legal tool","5","9"
"UNKNOWN CC legal tool","9","4"

I was wondering the purpose of collecting UKNOWN CC Legal tools values. Will they be used to diagnose data quality (like a plot of unkown vs known license coverage), or should they be excluded from the analysis phase?

@TimidRobot
Copy link
Member

Without feedparser, the script can't process ArXiv's API responses. So i think you have to add the feedparser dep to your pipfile to run the script.

@Goziee-git no, you need to include an updated Pipfile and Pipfile.lock in your pull request.

Without feedparser, the script can't process ArXiv's API responses. So i think you have to add the feedparser dep to your pipfile to run the script.

@Goziee-git no, you need to include an updated Pipfile and Pipfile.lock in your pull request.

@TimidRobot, since this will require that i add a new dependency/tooling to the Pipefile and the Pipfile.lock would you prefer that i submit the updates for this here without raising an issue for it or otherwise

You can include Pipfile and Pipfile.lock in this PR (no new issue necessary). Please be sure changes are limited to what is absolutely necessary.

@TimidRobot
Copy link
Member

Hi @TimidRobot @Goziee-git

I have noticed entries like these in the CSV files:

"TOOL_IDENTIFIER","AUTHOR_COUNT","COUNT"
"UNKNOWN CC legal tool","3","12"
"UNKNOWN CC legal tool","2","18"
"UNKNOWN CC legal tool","1","9"
"UNKNOWN CC legal tool","8","4"
"UNKNOWN CC legal tool","4","13"
"UNKNOWN CC legal tool","10","5"
"UNKNOWN CC legal tool","5","9"
"UNKNOWN CC legal tool","9","4"

I was wondering the purpose of collecting UKNOWN CC Legal tools values. Will they be used to diagnose data quality (like a plot of unkown vs known license coverage), or should they be excluded from the analysis phase?

I think it's helpful to include the unknowns, but I expect a single entry.

@Goziee-git
Copy link
Author

Without feedparser, the script can't process ArXiv's API responses. So i think you have to add the feedparser dep to your pipfile to run the script.

@Goziee-git no, you need to include an updated Pipfile and Pipfile.lock in your pull request.

Without feedparser, the script can't process ArXiv's API responses. So i think you have to add the feedparser dep to your pipfile to run the script.

@Goziee-git no, you need to include an updated Pipfile and Pipfile.lock in your pull request.

@TimidRobot, since this will require that i add a new dependency/tooling to the Pipefile and the Pipfile.lock would you prefer that i submit the updates for this here without raising an issue for it or otherwise

You can include Pipfile and Pipfile.lock in this PR (no new issue necessary). Please be sure changes are limited to what is absolutely necessary.

Hello @TimidRobot, in addition to the feedparser, I also have a new dependency here PyYAML. The PyYAML dep is used by the arxiv_fetch.py and the arxiv_category_converter.py. Both scripts use import yaml which comes from the PyYAML package for:
• Loading category mappings (arxiv_category_map.yaml)
• Writing provenance data for audit trails
• Reading configuration files
I am making a note of this here because I am also going to be adding it to the pipfile and also the pipfile.lock which will be added to this PR.

In contrast, i also observed that the gcs_country_collection.yaml and gcs_language_collection.yaml files don't require PyYAML for their generation. However, if any Python scripts read these YAML files later, they would need PyYAML. so with your permission can i also add these here or would you prefer otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Integrate arXiv as data source for academic commons quantification

3 participants