-
Notifications
You must be signed in to change notification settings - Fork 69
OpenStreetMap Integration #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OpenStreetMap Integration #236
Conversation
…gaete/powerplantmatching into feature_osm_perf_improvements
…gaete/powerplantmatching into feature_osm_perf_improvements
…gaete/powerplantmatching into feature_osm_perf_improvements
OSM Integration ExampleI've added an example script demonstrating the new OSM module functionality. The script ( Example Script:# Main execution
if __name__ == "__main__":
# Set up output directory
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)
# List of countries to process
countries = [
"Chile",
"South Africa",
"Indonesia",
]
# Get the base configuration
config = get_config()
config["main_query"] = ""
config["target_countries"] = countries
config["OSM"]["force_refresh"] = False
config["OSM"]["plants_only"] = False
config["OSM"]["units_clustering"]["enabled"] = False
config["OSM"]["missing_name_allowed"] = False
config["OSM"]["missing_technology_allowed"] = False
config["OSM"]["missing_start_date_allowed"] = True
# Get combined data for all countries
osm_data = OSM(raw=False, update=False, config=config)
gem_data = GEM(raw=False, update=False, config=config)
fig, axis = fueltype_totals_bar([gem_data, osm_data], keys=["GEM", "OSM"])
plt.savefig(os.path.join(output_dir, "osm_gem_ppm.png"))
fig, axis = fueltype_and_country_totals_bar([gem_data, osm_data], keys=["GEM", "OSM"])
plt.savefig(os.path.join(output_dir, "osm_gem_ppm_country.png"))Results:The example generates two plots comparing OSM and GEM data: These results show that OSM provides data quality comparable to specialized sources like GEM for major generation types (Hard Coal, Hydro, Natural Gas), with particularly strong alignment in South Africa. The OSM module successfully captures most major power plant types while also providing additional data for some categories (like Oil and Waste) that may be missing in other sources. The differences highlight opportunities for enhancements, particularly for Wind and Geothermal power plants, which show lower coverage in OSM. This will be covered in the coming tasks where the focus is data validation, and code edition towards data quality enhancement. |
Next Steps for OSM ImplementationData Validation Across Multiple CountriesThe next phase will focus on thorough data validation:
Documentation and TutorialsTo support the implementation, development of supporting materials will include:
|
- Document comprehensive OpenStreetMap data source with advanced processing - Highlight dual purpose for energy analysts and OSM community - Include details on caching, enhancement features, and quality control
…en-energy-transition/powerplantmatching into feature_osm_perf_improvements
|
@cdgaete I have tested your example all it works impressively well. would you mind fixing the remaining failing tests (it is probably just fixing a deprecation, we can also drop support for python 3.9 if needed). then we can finally merge |
4f9a6f5 to
614713e
Compare
Breaking changes: - Minimum Python version is now 3.10 (was 3.9) - Remove mypy and all type checking infrastructure Changes: - Update requires-python to >=3.10 in pyproject.toml - Remove Python 3.9 from classifiers - Delete type-checking.yml GitHub workflow - Remove mypy and type stub dependencies (mypy, types-requests, types-PyYAML, pandas-stubs, types-tqdm, types-six) - Remove [tool.mypy] configuration section from pyproject.toml - Update release notes to document breaking changes - Update isinstance calls to use Python 3.10+ union syntax (X | Y)
66327d7 to
749e3fc
Compare
- Replace Optional[X] with X | None throughout OSM module - Update isinstance calls to use X | Y syntax - Apply ruff formatting for consistency with Python 3.10+ standards
|
Yes, I would like to have a look, too. Before merging, we need to understand what impact it has on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds comprehensive OpenStreetMap (OSM) integration capabilities to powerplantmatching, enabling extraction of power plant data from the global OpenStreetMap database. The implementation provides a robust data processing pipeline with caching, quality tracking, and analysis tools.
- Implements a complete OSM data extraction pipeline with hierarchical processing (relations → ways → nodes)
- Adds sophisticated multi-level caching system to minimize API calls and improve performance
- Introduces capacity estimation, plant reconstruction, and rejection tracking for data quality improvement
Reviewed Changes
Copilot reviewed 43 out of 45 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyrightconfig.json | Adds Pyright type checker configuration for basic type checking |
| pyproject.toml | Updates Python version requirement to 3.10+, adds new dependencies (scikit-learn, shapely), removes mypy configuration |
| powerplantmatching/plot.py | Removes large commented code blocks, adds new plotly_map function for interactive mapping |
| powerplantmatching/package_data/config.yaml | Adds comprehensive OSM configuration section with API settings, source mappings, and processing parameters |
| powerplantmatching/osm/workflow.py | Implements main workflow orchestration for OSM data processing pipeline |
| powerplantmatching/osm/utils.py | Provides utility functions for capacity parsing, country validation, and cache path management |
| powerplantmatching/osm/retrieval/regional.py | Implements regional download functionality for custom geographic areas |
| powerplantmatching/osm/retrieval/populate.py | Provides cache population utilities for batch processing multiple countries |
| powerplantmatching/osm/retrieval/client.py | Core Overpass API client with retry logic, caching, and progress tracking |
| powerplantmatching/osm/retrieval/cache.py | Multi-level caching system for OSM elements and processed units |
| powerplantmatching/osm/retrieval/init.py | Package initialization for retrieval module |
| powerplantmatching/osm/quality/rejection.py | Comprehensive rejection tracking system for data quality analysis |
| powerplantmatching/osm/quality/coverage.py | Cache coverage analysis and maintenance tools |
| powerplantmatching/osm/quality/init.py | Package initialization for quality module |
| powerplantmatching/osm/parsing/plants.py | Parser for OSM power plant relations with reconstruction capabilities |
Comments suppressed due to low confidence (2)
pyproject.toml:13
- Removing Python 3.9 support is a breaking change. Consider documenting this in release notes or maintaining backward compatibility if existing users depend on Python 3.9.
license = { file = "LICENSE" }
powerplantmatching/osm/parsing/plants.py:434
- Using
assertfor runtime validation in production code is problematic. The assertion could be disabled withpython -O. Consider raising a proper exception likeValueErrororRuntimeErrorinstead.
mismatch_ratio = (
| return False, None, "regex_error" | ||
| else: | ||
| if regex_patterns is None: | ||
| regex_patterns = [ |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex patterns for capacity parsing are hardcoded. Consider moving these to configuration to make them easily customizable without code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not move these away, but add a comment what each regex intends to match
| while retries < self.max_retries: | ||
| try: | ||
| response = requests.post( | ||
| self.api_url, data={"data": query}, timeout=self.timeout + 30 |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout value is calculated as self.timeout + 30 but this could result in very long timeouts if self.timeout is large. Consider setting a maximum timeout limit to prevent indefinite waits.
| self.api_url, data={"data": query}, timeout=self.timeout + 30 | |
| self.api_url, data={"data": query}, timeout=min(self.timeout + 30, MAX_TIMEOUT) |
|
|
||
| def _save_cache(self, cache_path: str, data: dict) -> None: | ||
| """Save dictionary to JSON cache file.""" | ||
| cache_data = data if data else {} |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The condition data if data else {} can be simplified to data or {} for better readability.
| cache_data = data if data else {} | |
| cache_data = data or {} |
| rejections_to_process.extend(rejections) | ||
|
|
||
| for rejection in rejections_to_process: | ||
| if rejection.coordinates is None or "cluster" in rejection.id.lower(): |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using string matching with "cluster" in rejection.id.lower() is fragile. Consider adding a proper flag or enum value to identify cluster rejections more reliably.
|
it does not have affect on the default powerplants.csv as it is not added to the default matching sources. so at this stage it is purely about adding an optional source. however @cdgaete would you mind making a short comparison of the powerplants.csv with and without OSM (preferrably for europe) ? |
maurerle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is huge.
It increases the code base by a multiple of its current size and is hard to review.
Files which are not essential to the features of this PR are changed here as well.
I made some code remarks for better readability.
Recommendations to split functions or reduce the scope of classes (by not including the config everywhere).
.pre-commit-config.yaml
Outdated
| # - repo: https://github.com/RobertCraigie/pyright-python | ||
| # rev: v1.1.383 | ||
| # hooks: | ||
| # - id: pyright |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this unrelated change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated cchange
| # GNU General Public License for more details. | ||
|
|
||
| # You should have received a copy of the GNU General Public License | ||
| # along with this program. If not, see <http://www.gnu.org/licenses/>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of full removal, maybe add SPDX-Headers?
| logger.info(f"Total countries to process: {len(all_countries)}") | ||
|
|
||
| if dry_run: | ||
| print("\nDry run - would download the following countries:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger instead of printing?
| return False, None, "regex_error" | ||
| else: | ||
| if regex_patterns is None: | ||
| regex_patterns = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not move these away, but add a comment what each regex intends to match
|
|
||
| total_elements = total_plants + total_generators | ||
|
|
||
| if check_live_counts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe extract this branch into a separate function which updates the passed cached_countries?
| ) | ||
| cached_countries[country_code]["cache_status"] = "error" | ||
|
|
||
| if return_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of spending another 100 lines of code in this function, I would rather create a function which works on the output dict and just handles the printing.
That way, we can spare the return_data parameter as well and receive cleaner code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
| Minimum similarity ratio for common substrings | ||
| """ | ||
|
|
||
| def __init__(self, config: dict[str, Any]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class only needs name_similarity_threshold - why give it the whole config?
|
|
||
| def __init__( | ||
| self, | ||
| config: dict[str, Any], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class only needs the min_generators_for_reconstruction - why give it the whole config?
…fig management - Refactor classes to accept specific parameters instead of full config objects - Split large functions in coverage.py into focused, single-purpose functions - Replace print statements with proper logging throughout codebase - Add SPDX license headers to all new files - Replace runtime assertions with proper exception handling - Enhance caching with diskcache for better memory management - Add comprehensive European power plant analysis (39K plants, 850+ GW OSM contribution) - Improve error handling and CSV corruption recovery - Add omitted countries support for API limitations
Response to Review FeedbackThank you @maurerle, @FabianHofmann, @fneum, and @lkstrp for the thorough reviews and constructive feedback. I've made substantial improvements to address the core concerns raised about code organization, configuration management, and maintainability. Disclaimer: the huge jump in line of codes is associated to a html report in the documentation. @maurerle's Review CommentsConfiguration Dependencies Reduction:
Function Decomposition:
Logging Instead of Print Statements:
License Headers:
Unrelated Changes:
@copilot CommentsRuntime Assertions:
Breaking Changes Documentation:
Code Simplification:
Regex Pattern Documentation:
Additional Architectural ImprovementsEnhanced Caching System:
Improved Error Handling:
Configuration Management:
Documentation and European Power Plant Analysis (addressing @FabianHofmann and @fneum's request for understanding OSM impact):
Comments Evaluated but Not ImplementedTimeout Management (Copilot AI suggestion):
String Matching for Rejection Types (Copilot AI suggestion):
Viewing the EU Analysis ReportThe comprehensive European analysis is available in multiple formats: OSM_Data_Coverage_Analysis_documentation.pdf Interactive Map: Clone the branch and open Static Analysis: The downloadable script Documentation: See Please let me know if any specific areas need further clarification or if you'd like me to address any of the remaining optimization suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdgaete could you use scatter clustering for easier viewing of your map?
|
I don't think I will have enough time. I am fine as long as it's not in the default matching sources, but generally share the sentiment of @maurerle that this PR is basically a package on its own (given its size) and impossible to review (and possibly maintain) in detail. What I liked about |
lkstrp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I liked about powerplantmatching so far is that it took in some pre-processed data without the need to connect to APIs etc. I think it's worth considering the option to outsource this to another repository and then work with a pre-processed extract of OSM data. Keeps the scope of powerplantmatching perhaps a bit smaller.
I agree, provided this remains as bloated as it is. @cdgaete, I'm afraid the code and comments in this PR are mostly copied and pasted agent output. They're very verbose, contain a lot of meaningless stuff and could be summarised to 1/10th of length. I always expected this to get cleaned up further. The feature might be cool, but I don't see it anywhere near getting merged in ppm.
If this get's cleaned up, I can add a lot of more review comments (next to those of @maurerle, which haven't been resolved yet) and many of those are obvious. But this will need more work. Otherwise we can bring that to another repo, and import just the dataset to ppm, as proposed by fneum.
|
Thanks for the detailed feedback. You're right - the OSM implementation has grown into a package shape. What started as a straightforward data source integration revealed OSM's complexity: community-built data with numerous edge cases requiring individual handling to extract reliable power plant information. Regarding the code: while I used AI assistance for code and documentation, I'm the one architecting and refining every implementation detail. The scope is genuinely large - ~8k lines for OSM core logic, ~9k for the EU visualization, plus documentation. This isn't verbosity; it reflects the extensive feature set needed to handle OSM's data quality challenges. The OSM includes enhancement features (cache validation, regional coverage report) that could be removed for the core integration, but might be valuable if we spin this into a separate package. @lkstrp I carefully addressed maurerle's initial review - I believe all comments were considered and responded to. If there are specific unresolved items, please point me to them so I can check what's missing. Given the feedback from @fneum, and @lkstrp converging on the same point, I agree the cleanest path forward is a separate repository. This would let us maintain OSM's complexity independently while providing powerplantmatching with pre-processed, clean datasets. If we opt for this option, I can put a Zenodo link to the OSM resulting datasets and implement here in PPM the script that downloads and integrates it in the pipeline as the other datasets already do. |
|
That option sounds great! Thanks @cdgaete and sorry that we've been so critical! |


Closes #12
Supersedes #225
Changes proposed in this Pull Request
Current Implementation Focus
The OSM module has been designed with several key architectural components:
Comprehensive Data Processing Pipeline: A structured workflow that processes OSM elements based on hierarchy (relations, ways, nodes), with specialized handling for each element type
Advanced Geometric Analysis: Utilizes spatial relationships to identify overlapping power plants and prevent duplication
Multi-level Data Extraction: Implements a cascading approach to extract capacity information:
Efficient Caching Architecture: Implemented a multi-tiered caching system that significantly reduces API requests and improves performance, enabling quick analysis of large geographic areas
Rejection Tracking System: The rejection tracker serves two critical purposes:
Benefits
Checklist