Data Pipeline Assignment

Running the application

First, checkout the code:

git clone https://github.com/csengevirag/assignment.git

Then have your csv file path ready to insert into docker-compose.yml line 29: change /absolut/path/to/csv/testfile.csv to your file
(I did not want to include the example file of possible privacy issues)
Requirements: docker, docker-compose

Running the application:

docker-compose build
docker-compose up

This will automatically create the postgresql database and tables too.

Design Decisions

Having four different tables reduces data redundancy (e.g., brands are saved in one place, and product types are stored based on their grouping). This approach improves data integrity and consistency: updates to data only need to happen in one place, reducing the risk of inconsistent data. It also prevents anomalies; for example, you can't accidentally delete a product and lose the only record of a brand. Lastly, it allows for more efficient use of storage by eliminating repeated data.
In case of missing fields, the application will print out an error message with line number and existing values, it will not insert anything into either of the database tables.

Data Cleaning

Caching was introduced as a way to make the pipeline faster, so the application does not have to query for already existing values.
Color, age_group, gender and size_type column values are transformed into Title case format

Brand Name Cleaning

Because there are so many variations of brand names, some with special characters, I focused on two main use cases:

Different Casing: If the same brand appears with different casing, the first occurrence of the value is stored, and subsequent ones are compared to it after canonicalization.
Sub-brands: If a brand has different "sub-brands" (i.e., an addition after the original brand name, such as "Kidz," "Tall," etc.), these are handled by removing the additional strings and canonicalizing the brand name.

I concentrated on these cases because there can be many instances of typos, but I cannot be sure if a brand that is just one letter different from another is not actually a new brand. Instead of filtering these out and increasing the risk of incorrect data or data loss, I chose to focus on removing sub-brand strings and canonicalization.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docker		docker
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.MD		README.MD
dependency-reduced-pom.xml		dependency-reduced-pom.xml
docker-compose.yml		docker-compose.yml
pom.xml		pom.xml
test_products.csv		test_products.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Pipeline Assignment

Running the application

Design Decisions

Data Cleaning

Brand Name Cleaning

About

Uh oh!

Releases

Packages

Languages

csengevirag/assignment

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline Assignment

Running the application

Design Decisions

Data Cleaning

Brand Name Cleaning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages