Geolocation Service

This application imports large geolocation datasets from a CSV file into a PostgreSQL database and provides an API to query the geolocation data by IP address.

Features

Chunk-Based CSV Import: Handles large datasets efficiently by processing data in chunks.
Validation: Ensures only valid rows are imported, skipping duplicates and invalid entries.
Deduplication: Removes duplicate rows based on the IP address.
PostgreSQL: Uses a production-grade database.
API: Provides an endpoint to query geolocation data by IP address.
Containerization: Uses Docker and Docker Compose for easy setup and deployment.

Application Structure

geo_service/
├── app/
│   ├── __init__.py            # Empty file to make app a module
│   ├── config.py              # Application configuration using .env
│   ├── constants.py           # Constants for CSV column names
│   ├── database.py            # Database setup and connection logic
│   ├── models.py              # Database schema for geolocation records
│   ├── main.py                # FastAPI app initialization
│   ├── routes.py              # API routes definition
├── etl/
│   └── data_dump.csv          # Given CSV file with geolocation data
│   ├── validators.py          # Validation logic for data sanitization
│   ├── pipeeline.py           # initiate the etl process
├── tests/
│   ├── __init__.py
│   ├── test_data/             # Sample data for testing
│   ├── test_etl.py            # unit test for etl services like data loading
│   ├── test_database.py       # unit test for database
│   ├── test_api.py            # unit test for api
├── .env                       # Environment variables for configuration
├── Dockerfile                 # Dockerfile to containerize the application
├── docker-compose.yml         # Compose file to run app and PostgreSQL together
├── requirements.txt           # Python dependencies with versions
├── README.md                  # Documentation

Setup Instructions

Prerequisites

Docker and Docker Compose installed
credentials configured in .env, (ref: example.env)

Installation

Build and start the application:
```
docker compose up -d
```
Access the application:
- Swagger API Docs: http://localhost:8000/docs

Importing Data

To import data from a CSV file into the database:

Trigger ETL process from container shell:

docker exec geo-web-app python app/etl/pipeline.py

Querying Data

Query the geolocation data by IP address:
```
GET /geolocation/{ip_address}
```

Example Response:

{
   "ip_address": "85.175.199.150",
   "country_code": "TW",
   "country": "Saudi Arabia",
   "city": "Deckowtown",
   "latitude": 25.6561924125008,
   "longitude": -163.7348649682793
}

DataBase UI

pgadmin : http://localhost:5050

Testing

To run unit tests:

Run unit tests with code coverage report:

docker exec -it geo-web-app pytest --disable-warnings --cov=app tests/

Design Choices

PostgreSQL: Chosen for scalability and production-readiness.
Pandas: Efficient for large-scale data processing and validation.
FastAPI: A modern light weight framework for building high-performance APIs with documentation.

Trade-offs

Limited scalability:
- With a growing dataset, this approach will struggle because as it is single-threaded, chunk-by-chunk process doesn't fully utilize CPU resources or database concurrency.
Bottleneck for large dataset:
- when working with data that exceeds the capacity (dataset cannot fit into memory) of pandas, owe can leverage distributed processing.

Scope Of Improvements

Temporarily drop indexes, then recreate them after the load.
Add caching (e.g., lru or Redis) for frequent queries.
Use multithreading for concurrent chunk processing.
Distributed processing tool like pySpark for really large dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Geolocation Service

Features

Application Structure

Setup Instructions

Prerequisites

Installation

Importing Data

Querying Data

DataBase UI

Testing

Design Choices

Trade-offs

Scope Of Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
app		app
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
example.env		example.env
requirements.txt		requirements.txt

raushan-in/data_process

Folders and files

Latest commit

History

Repository files navigation

Geolocation Service

Features

Application Structure

Setup Instructions

Prerequisites

Installation

Importing Data

Querying Data

DataBase UI

Testing

Design Choices

Trade-offs

Scope Of Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages