-
-
Notifications
You must be signed in to change notification settings - Fork 60
Add arXiv data fetching functionality #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start.
I recommend also developing a data/report plan. For example:
- It is not meaningful to get a count of a single language (though it is worth noting that other languages are not available).
- Category codes converted to reporting (words and/or abbreviations instead of acronyms)
This comment was marked as outdated.
This comment was marked as outdated.
hello @TimidRobot as requested, I have made changes based on your review. Good work is emphasized over speed and I do hope my attempt to go full circle with other PR hasn't dented my chances of significantly contributing to the project. Thank you🙏🏼 |
|
@Goziee-git ok, please focus on this PR |
@TimidRobot I have removed the query for languages as it returns only English. Also worthy of note here, the arxiv data source accepts papers in other languages but requires that the paper abstracts be submitted in English. so Impossible to get a good distribution of licenses as per language usage. Also, as suggested. I converted the category codes to reporting words that a more user-friendly and readable using an external |
|
Hello @TimidRobot, i observed from multiple results fetched previously that the script failed to fetch CC licenses that may be recorded as hyphenated variants like (CC-BY, CC-BY-NC, etc). I have implemented a compiled regex pattern that replaces the string matching for more robust license detection I have also looked at some of the implementations in other PR to use the normalize_license_text() function for consistent license identification. Please i'ld like to know what your thoughts are on these changes and work continuously on further improvements, Thanks. |
I'll look at data after outstanding comments are resolved.
I'll look at data after outstanding comments are resolved. That said, I'm not excited about adding JSON to the project. |
It's probably a good idea to create a function in the shared library eventually. Please leave that to last, however. |
Refactor arxiv_fetch.py to use requests library for HTTP requests, implementing retry logic for better error handling. Update license extraction logic and CSV headers to remove PLAN_INDEX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unable to test the script.
Command:
Output:
Traceback (most recent call last):
File "/Users/timidrobot/git/creativecommons/quantifying/./scripts/1-fetch/arxiv_fetch.py", line 18, in <module>
import feedparser
ModuleNotFoundError: No module named 'feedparser'
scripts/1-fetch/arxiv_fetch.py
Outdated
| return session | ||
|
|
||
|
|
||
| def normalize_license_text(raw_text: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this function have types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this function have types?
@TimidRobot, the type hint in the normalize_license_text() helps to ensure consistency in the data-type expected since entries from the arxiv API can be returned in several data types. So this helps with processing and avoiding type errors. Normalizing the returned value as a string type i figured would also help to Prevents type-related bugs when the returned license identifier is used in dictionaries, CSV writing, or logging. Would you prefer a different approach or implementation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Goziee-git I meant, "Why does this function alone have types?". It looks like you copied something without understanding it (you should never submit code you don't understand).
Type checking in Python still requires supporting tooling. For example:
File: test.py
def test_type(test: str) -> str:
print(type(test))
test_type(1)
Command:
python3 ./test.py
Output:
<class 'int'>
Any new tooling should first be decided in an issue and then added in a dedicated pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Goziee-git I meant, "Why does this function alone have types?". It looks like you copied something without understanding it (you should never submit code you don't understand).
Type checking in Python still requires supporting tooling. For example:
File:
test.pydef test_type(test: str) -> str: print(type(test)) test_type(1)Command:
python3 ./test.pyOutput:
<class 'int'>Any new tooling should first be decided in an issue and then added in a dedicated pull request.
@TimidRobot, I have raised an issue #212 for the type check tooling specifically choosing the mypy python package. I have also read through its usage. The current implementation here only shows type hints and no specific implementation. With your permission, i'ld like that you review the issue and give your recommendations since the implementation here can only be concrete when the issue #212 is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the type hint. It can be added back when/if type hints are supported by the project.
@TimidRobot,I ran the project as directed by the project README which requires installing the dependencies first and then running the script from the root of the project using I Think the Without feedparser, the script can't process ArXiv's API responses. So i think you have to add the |
@Goziee-git no, you need to include an updated |
@TimidRobot, since this will require that i add a new dependency/tooling to the |
d4409a4 to
6769a33
Compare
| FILE_ARXIV_YEAR = shared.path_join( | ||
| PATHS["data_1-fetch"], "arxiv_3_count_by_year.csv" | ||
| ) | ||
| FILE_ARXIV_AUTHOR = shared.path_join( | ||
| PATHS["data_1-fetch"], "arxiv_4_count_by_author_count.csv" | ||
| ) | ||
| FILE_ARXIV_AUTHOR_BUCKET = shared.path_join( | ||
| PATHS["data_1-fetch"], "arxiv_4_count_by_author_bucket.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Goziee-git!
I think you should order your constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @TimidRobot!
Given how large the constant list is, would you recommend a dictionary is used here?
For example (R43-R49),
FILE_ARXIV = {
"count": shared.path_join(PATHS["data_1-fetch"], "arxiv_1_count.csv"),
"category": shared.path_join(PATHS["data_1-fetch"], "arxiv_2_count_by_category.csv"),
"category_report": shared.path_join(PATHS["data_1-fetch"], "arxiv_2_count_by_category_report.csv"),
...
}
All related constants grouped in the same dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit of a dictionary is that it can be acted on programmatically. If that is helpful, a dictionary is a good idea. If not, I tend to prefer individual constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Goziee-git!
I think you should order your constants
Really, its a good practice to order constants, and i do have alot of constants. I agree with you that ordering them keeps the flow organised and conforms to proper coding practices. This script is still in development and will need all the insights i can get to make it best for longterm use. I appreciate this here @Babi-B, Thanks
| except Exception: | ||
| return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Goziee-git!
I have a question about this. Why do you choose to silently swallow up the exceptions? Would it not be better to log an error/warning so the user knows the mapping wasn't loaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is still in development and thanks @Babi-B for the notes here. I have looked into it carefully and have implemented changes to properly log exceptions instead of silently handling them.
dev/arxiv_category_converter.py
Outdated
| @@ -0,0 +1,73 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this script meant to be imported purely as a module? If it is, I think it is not needed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the shebang
| data_dir: Directory containing arxiv_category_map.yaml | ||
| """ | ||
| if not os.path.exists(input_file): | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too...a silent return. Why not a warning or an exception? Could this not lead to confusing downstream errors?
|
|
||
| with ( | ||
| open(input_file, "r") as infile, | ||
| open(output_file, "w", newline="") as outfile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add encoding="utf-8 to prevent platform-dependent encoding issues
| ] | ||
|
|
||
| # Log the start of the script execution | ||
| LOGGER.info("Script execution started.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should put this in the main function so it triggers when the script actually runs
scripts/1-fetch/arxiv_fetch.py
Outdated
| def initialize_data_file(file_path, headers): | ||
| """Initialize CSV file with headers if it doesn't exist.""" | ||
| if not os.path.isfile(file_path): | ||
| with open(file_path, "w", newline="") as file_obj: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add encoding="utf-8" here too
scripts/1-fetch/arxiv_fetch.py
Outdated
| retry_strategy = Retry( | ||
| total=5, | ||
| backoff_factor=1, | ||
| status_forcelist=[408, 429, 500, 502, 503, 504], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a STATUS_FORCELIST constant in the shared.py
scripts/1-fetch/arxiv_fetch.py
Outdated
| # author_counts: {license: {author_count(int|None): count}} | ||
|
|
||
| # Save license counts | ||
| with open(FILE_ARXIV_COUNT, "w", newline="") as fh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding="utf-8" here too
scripts/1-fetch/arxiv_fetch.py
Outdated
| writer.writerow({"TOOL_IDENTIFIER": lic, "COUNT": c}) | ||
|
|
||
| # Save detailed category counts (code) | ||
| with open(FILE_ARXIV_CATEGORY, "w", newline="") as fh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding="utf-8"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed you didn't add encoding="utf-8" to your open() calls when writing to CSVs. I didn't comment on all, but it is important to do so to avoid potential encoding issues where UTF-8 might not be the default
|
I have noticed entries like these in the CSV files: I was wondering the purpose of collecting |
@Goziee-git no, you need to include an updated
You can include |
I think it's helpful to include the unknowns, but I expect a single entry. |
Hello @TimidRobot, in addition to the In contrast, i also observed that the |
…ared STATUS_FORCELIST
- Removed type hints from normalize_license_text function - Type checkers not yet used as tooling for the project
Fixes
Description
Implements comprehensive arXiv data collection system to quantify open access academic papers in the commons.
Type of Change
Changes Made
scripts/1-fetch/arxiv_fetch.pyTesting
./dev/check.sh)Data Impact
Related Documentation
sources.mdwith arXiv API credentials setupChecklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin