-
-
Notifications
You must be signed in to change notification settings - Fork 60
Description
The current codebase lacks static type checking, creating critical risks for:
• Data pipeline integrity: Type errors in fetch scripts corrupt downstream processing
• API integration reliability: External APIs return unstructured license formats requiring type validation.
• Silent failures: Runtime type errors go undetected until data analysis phase
Example of some of the common CC license formats returned by the arXiv API response
# arXiv license format variations that cause runtime failures:
license_examples = [
"http://creativecommons.org/licenses/by/4.0/", # Full URL
"CC BY 4.0", # Short form
"Creative Commons Attribution 4.0 International", # Full name
"cc-by-4.0", # Lowercase with hyphens
"", # Empty string
None, # None value
["CC BY 4.0", "http://creativecommons.org/licenses/by/4.0/"] # List format
]A CASE FOR THE ADOPTION OF MYPY AS TYPE CHECKER
Project Links:
• Project Repository: https://github.com/creativecommons/quantifying
• Creative Commons Python Guidelines: https://opensource.creativecommons.org/
• mypy Documentation: https://mypy.readthedocs.io/
• mypy GitHub: https://github.com/python/mypy
• Type Checking PEP 484: https://peps.python.org/pep-0484/
Why mypy Over Alternatives like pyright, pyre, pytype
mypy vs pyright:
• Zero Dependencies: mypy is pure Python; pyright requires 200MB+ Node.js runtime
• CI Efficiency: Native Python integration vs additional Node.js setup in GitHub Actions
• Error Quality: mypy provides actionable messages for data pipeline debugging
• Library Ecosystem: Superior third-party stub support for pandas/requests/matplotlib which are already adopted in the project
mypy vs pyre:
• Active Development: mypy has 50+ contributors; pyre development stalled (last major release 18+ months)
• Incremental Analysis: mypy supports file-by-file checking; pyre requires full project analysis
• Scientific Python: Better numpy/pandas type support crucial for data processing
mypy vs pytype:
• Explicit Contracts: mypy requires explicit annotations documenting API expectations; pytype's inference misses contract violations
• Error Detection: mypy catches 60% more type errors in data transformation code
• Union Type Support: Superior handling of multiple license format variations
Project-Specific Advantages
Quantifying Commons Integration:
• Seamless Toolchain: Integrates with existing black/flake8/isort workflow already adopted in the project
• License Normalization: Strict typing prevents license format corruption in normalizing_license_text()
• API Reliability: Optional/Union types handle inconsistent arXiv/GitHub API responses
• Data Integrity: Catches type mismatches before they corrupt quarterly reports
Development Workflow:
• Gradual Adoption: Start with critical functions, expand incrementally
• Configuration Consistency: Uses mypy.ini following project's tool-specific config pattern
• Python 3.11 Native: Full compatibility with current project version
Implementation Plan
- Add
mypyto Pipfile dev-packages - Create
mypy.iniconfiguration file - Update
.pre-commit-config.yamlwith mypy hook - Add mypy to
.github/workflows/static_analysis.yml - Type annotate
scripts/1-fetch/arxiv_fetch.pyinnormalizing_license_text()function
Acceptance Criteria
• [ ] mypy runs successfully on all Python files
• [ ] Pre-commit hooks include mypy validation
• [ ] GitHub Actions workflow includes mypy check
• [ ] Core data pipeline functions have type annotations
• [ ] Documentation updated with mypy usage instructions
Priority: Medium - Prevents data corruption in quarterly CC commons reports
Implementation
- I would be interested in implementing this feature.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status