Skip to content

csengevirag/assignment

Repository files navigation

Data Pipeline Assignment

Running the application

First, checkout the code:

git clone https://github.com/csengevirag/assignment.git

Then have your csv file path ready to insert into docker-compose.yml line 29: change /absolut/path/to/csv/testfile.csv to your file
(I did not want to include the example file of possible privacy issues)
Requirements: docker, docker-compose

Running the application:

docker-compose build
docker-compose up

This will automatically create the postgresql database and tables too.

Design Decisions

Having four different tables reduces data redundancy (e.g., brands are saved in one place, and product types are stored based on their grouping). This approach improves data integrity and consistency: updates to data only need to happen in one place, reducing the risk of inconsistent data. It also prevents anomalies; for example, you can't accidentally delete a product and lose the only record of a brand. Lastly, it allows for more efficient use of storage by eliminating repeated data.
In case of missing fields, the application will print out an error message with line number and existing values, it will not insert anything into either of the database tables.

Data Cleaning

Caching was introduced as a way to make the pipeline faster, so the application does not have to query for already existing values.
Color, age_group, gender and size_type column values are transformed into Title case format

Brand Name Cleaning

Because there are so many variations of brand names, some with special characters, I focused on two main use cases:

  1. Different Casing: If the same brand appears with different casing, the first occurrence of the value is stored, and subsequent ones are compared to it after canonicalization.
  2. Sub-brands: If a brand has different "sub-brands" (i.e., an addition after the original brand name, such as "Kidz," "Tall," etc.), these are handled by removing the additional strings and canonicalizing the brand name.

I concentrated on these cases because there can be many instances of typos, but I cannot be sure if a brand that is just one letter different from another is not actually a new brand. Instead of filtering these out and increasing the risk of incorrect data or data loss, I chose to focus on removing sub-brand strings and canonicalization.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published