End-to-End ML Pipeline for Fraud Detection System

This project implements a fully automated end-to-end machine learning pipeline that ingests raw data, performs data preprocessing and feature engineering, trains and evaluates multiple machine learning models, generates predictions and monitor model performance overtime.

The reference dataset is from kaggle: https://www.kaggle.com/datasets/computingvictor/transactions-fraud-datasets

📝 Installation

Clone the project

  git clone https://github.com/Printf-Hello-World/CS611.git

Go to the project directory

  cd CS611

Checkout branch

  git checkout -b <branch name>

Docker Compose

  docker-compose up --build

🚀 Pipeline Containers Overview

Category	Containers	Description
🛠️ Airflow Containers	Apache Airflow 3.02 containers	Core Airflow components for orchestration and API management
📊 MLflow Container	mlflow	Experiment tracking and model registry
💻 Machine Learning	main (Jupyter Lab)	Model development and training environment
⚡ Inference & Monitoring	FastAPI	Serving inference APIs and monitoring endpoints
🗄️ Database Containers	redis, postgres	Caching and relational database storage

🔥 Running the pipelines

The pipelines in this project can be run separately within their docker containers, or they can be triggered using the airflow UI.

🛠️ Data processing pipeline

The ETL pipelines are configured using the etl_conf.yaml file in the ETL folder.

Configure etl date range at etl_conf.yaml to specify the start_date and end_date

  start_date: "2018-01-01"
  end_date: "2019-01-31"

Run the following command in project root

  python run_data_pipeline.py

This script will:

Create bronze, silver and gold features and labels and store it in an offline feature store.
Create an online feature store by pulling the data from the offline feature store.

🤖 Machine learning pipeline

The ML pipeline is configured using the ml_conf.yaml file in the ml folder.

This pipeline runs 3 different machine learning models:

Logistic Regression (using sklearn)
XGBoost (using xgboost)
Multi-Layer Perceptron(MLP) (using sklearn)

There are 2 main parts to configure for this pipeline.

Machine learning parameters

dataloader_config:
  # choose the training range, the range is from the start to the present day
  start_date: "2019-08-01"

  # gold feature dir
  gold_feature_dir: datamart/gold/feature_store

  # gold label dir
  gold_label_dir: datamart/gold/label_store

preprocessor_config:
  # SMOTE toggler
  use_smote: True

  # how many oot splits from the most recent data
  oot_splits: 3

  # how many days per split, days because we are using daily data
  oot_period: 7

  # features to keep (NEED TO INCLUDE LABEL)
  columns_to_keep:
    - is_fraud
    - amount
    - use_chip
    - merchant_city
    - merchant_state
    - zip 
    - mcc
    - errors
    - has_chip
    - num_credit_cards
    - card_brand
    - card_type
    - num_cards_issued
    - credit_limit
    - year_pin_last_changed
    - acct_opened_months
    - yrs_since_pin_changed
    - per_capita_income
    - credit_score
    - yearly_income

model_config:
  # experiment name for mlflow
  experiment_name: "demo"

  run_name: "demo run"

  # name of model
  model_name: "xgboost" # mlp, xgboost, or logistic

  # number of trials for optuna
  n_trials: 20

  # model path if just loading model, not training
  model_path: ""

Hyperparameter Search Space

optuna_config:
  tunable_params:
    n_estimators:
      type: int
      low: 100
      high: 500

    max_depth:
      type: int
      low: 3
      high: 10

    learning_rate:
      type: float
      low: 0.01
      high: 0.3

    subsample:
      type: float
      low: 0.6
      high: 1.0

    colsample_bytree:
      type: float
      low: 0.6
      high: 1.0

Run the following command in project root after configuring the above.

  python run_ml_pipeline.py

📋 Script Overview

This script will:

Load features and labels
Perform time-based train-test-oot splits
Preprocess features for both logistic regression and XGBoost
Apply SMOTE or undersampling depending on configuration
Tune hyperparameters using Optuna
Evaluate models and log all results to MLflow

Registering Models in MLflow

To register your models, follow these steps on the MLflow UI (usually at localhost:5000):

Navigate to the Models tab
Click Register Model
Select Create New Model
Enter the model name: "demo"
Go to the Model tab
Click on the model you just registered (e.g., "demo")
Add an alias "best" and save

🚀 Model Inference Pipeline

Important: Before making any inference, run the following command in project root. This script will use card_number in the inference json to query the Redis database and retrieve the required input features for inference.

python run_online_feature_data_pipeline.py

Next, spin up the ml_inference_monitoring_container. This will pull the best model from the MLflow registry to be used for inference. Then execute the following steps to get inference results:

Go to FastAPI (localhost:8000/docs#)
Click "Get" , "Try it out", "Execute" to check if model is ready for prediction. If ready, "Model is ready for predictions" will show up.
Click "Post", "Try it out". Enter inference data in json format. Then "Execute"

Inference results will be stored at monitoring/inference_data/

📊 Model Monitoring Pipeline

Run the following command in project root. This script will monitoring label and data drift.

  python monitoring_pipeline.py

Go to EvidentlyAI (localhost:9000)
Click on Reports tab to access monitoring reports

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
airflow		airflow
assets		assets
etl		etl
inference		inference
ml		ml
monitoring		monitoring
workspace/0197a1b6-eb6d-7912-9b16-6d1520222c9a		workspace/0197a1b6-eb6d-7912-9b16-6d1520222c9a
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
example_inference_input.txt		example_inference_input.txt
requirements.txt		requirements.txt
run.sh		run.sh
run_bronze_data_pipeline.py		run_bronze_data_pipeline.py
run_data_pipeline.py		run_data_pipeline.py
run_gold_data_pipeline.py		run_gold_data_pipeline.py
run_ml_pipeline.py		run_ml_pipeline.py
run_monitoring_pipeline.py		run_monitoring_pipeline.py
run_online_feature_data_pipeline.py		run_online_feature_data_pipeline.py
run_silver_data_pipeline.py		run_silver_data_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-End ML Pipeline for Fraud Detection System

Table of Contents

📝 Installation

🚀 Pipeline Containers Overview

🔥 Running the pipelines

🛠️ Data processing pipeline

🤖 Machine learning pipeline

📋 Script Overview

Registering Models in MLflow

🚀 Model Inference Pipeline

📊 Model Monitoring Pipeline

👥 Collaborators

About

Uh oh!

Releases

Packages

Languages

Printf-Hello-World/end-to-end-ml-pipeline

Folders and files

Latest commit

History

Repository files navigation

End-to-End ML Pipeline for Fraud Detection System

Table of Contents

📝 Installation

🚀 Pipeline Containers Overview

🔥 Running the pipelines

🛠️ Data processing pipeline

🤖 Machine learning pipeline

📋 Script Overview

Registering Models in MLflow

🚀 Model Inference Pipeline

📊 Model Monitoring Pipeline

👥 Collaborators

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages