This application imports large geolocation datasets from a CSV file into a PostgreSQL database and provides an API to query the geolocation data by IP address.
- Chunk-Based CSV Import: Handles large datasets efficiently by processing data in chunks.
- Validation: Ensures only valid rows are imported, skipping duplicates and invalid entries.
- Deduplication: Removes duplicate rows based on the IP address.
- PostgreSQL: Uses a production-grade database.
- API: Provides an endpoint to query geolocation data by IP address.
- Containerization: Uses Docker and Docker Compose for easy setup and deployment.
geo_service/
├── app/
│ ├── __init__.py # Empty file to make app a module
│ ├── config.py # Application configuration using .env
│ ├── constants.py # Constants for CSV column names
│ ├── database.py # Database setup and connection logic
│ ├── models.py # Database schema for geolocation records
│ ├── main.py # FastAPI app initialization
│ ├── routes.py # API routes definition
├── etl/
│ └── data_dump.csv # Given CSV file with geolocation data
│ ├── validators.py # Validation logic for data sanitization
│ ├── pipeeline.py # initiate the etl process
├── tests/
│ ├── __init__.py
│ ├── test_data/ # Sample data for testing
│ ├── test_etl.py # unit test for etl services like data loading
│ ├── test_database.py # unit test for database
│ ├── test_api.py # unit test for api
├── .env # Environment variables for configuration
├── Dockerfile # Dockerfile to containerize the application
├── docker-compose.yml # Compose file to run app and PostgreSQL together
├── requirements.txt # Python dependencies with versions
├── README.md # Documentation
- Docker and Docker Compose installed
- credentials configured in
.env
, (ref: example.env)
-
Build and start the application:
docker compose up -d
-
Access the application:
- Swagger API Docs:
http://localhost:8000/docs
- Swagger API Docs:
To import data from a CSV file into the database:
- Trigger ETL process from container shell:
docker exec geo-web-app python app/etl/pipeline.py
-
Query the geolocation data by IP address:
GET /geolocation/{ip_address}
-
Example Response:
{ "ip_address": "85.175.199.150", "country_code": "TW", "country": "Saudi Arabia", "city": "Deckowtown", "latitude": 25.6561924125008, "longitude": -163.7348649682793 }
- pgadmin :
http://localhost:5050
To run unit tests:
- Run unit tests with code coverage report:
docker exec -it geo-web-app pytest --disable-warnings --cov=app tests/
- PostgreSQL: Chosen for scalability and production-readiness.
- Pandas: Efficient for large-scale data processing and validation.
- FastAPI: A modern light weight framework for building high-performance APIs with documentation.
- Limited scalability:
- With a growing dataset, this approach will struggle because as it is single-threaded, chunk-by-chunk process doesn't fully utilize CPU resources or database concurrency.
- Bottleneck for large dataset:
- when working with data that exceeds the capacity (dataset cannot fit into memory) of pandas, owe can leverage distributed processing.
- Temporarily drop indexes, then recreate them after the load.
- Add caching (e.g., lru or Redis) for frequent queries.
- Use multithreading for concurrent chunk processing.
- Distributed processing tool like pySpark for really large dataset.