Skip to content

This project aims to identify potential fraud among healthcare providers using medical claims data. Leveraging structured datasets, we integrate, clean, and transform raw information from multiple sources to prepare it for modeling.

License

Notifications You must be signed in to change notification settings

ankitghosal/healthcare_fraud_detection

Repository files navigation

🏥 Healthcare Provider Fraud Detection 🕵️‍♂️

A complete data analysis and feature engineering pipeline for detecting fraudulent healthcare providers using real-world insurance claim datasets.

🔍 Project Overview

This project aims to identify potential fraud among healthcare providers using medical claims data. Leveraging structured datasets, we integrate, clean, and transform raw information from multiple sources to prepare it for modeling.

Key Components:

  • Data cleaning & preprocessing
  • Handling missing values
  • Feature engineering
  • Exploratory Data Analysis (EDA)
  • Merging inpatient, outpatient, and beneficiary data
  • Preparing data for machine learning models

📁 Dataset Structure

The project uses the following datasets:

  • Train.csv & Test.csv: Core label data
  • Train_Beneficiarydata.csv / Test_Beneficiarydata.csv: Patient-level data
  • Train_Inpatientdata.csv / Train_Outpatientdata.csv: Medical claim details

These datasets were merged and cleaned to form a robust base for fraud detection.


  • Patient Age Distribution
image_2025-07-18_02-38-53

  • Best Performing Model Confusion Matrix & ROC AUC Score
image_2025-07-18_02-39-27

  • Model Accuracy Chart wrt Accuracy, Precision, Recall, F1 Score, ROC AUC Score
image_2025-07-18_02-40-52

🛠️ Technologies Used

  • Python (Pandas, NumPy, Seaborn, Matplotlib)
  • Jupyter Notebook
  • Power BI (for optional visualization)
  • SQL / DAX / Excel (supporting tools)

🚀 Features & Work Done

  • ✅ Merged and aligned multi-source datasets
  • ✅ Created new features for provider-level fraud analysis
  • ✅ Engineered time-based and code-based variables
  • ✅ Visualized fraud patterns and claim distribution
  • ✅ Prepared dataset for supervised learning

📊 Visualization & Insights

  • Age group and gender patterns in fraud
  • Frequency of diagnosis/procedure codes
  • Hospital stay trends and anomalies

All transformations were designed with modeling readiness and business interpretability in mind.

📌 How to Run

  1. Clone the repo
  2. Place the dataset files in the same directory
  3. Open Recent_Merge_Train_Test.ipynb in Jupyter
  4. Run each cell step-by-step

💡 Future Scope

  • Model training using classification algorithms
  • SHAP-based model interpretation
  • Deployment via Flask or Streamlit

📚 Author

Ankit Ghosal
MCA Candidate @ SVU (2026)
Passionate about data, business insights, ML and ethical AI.

🤝 Co-Contributor

Atanu Paul - let's connect
MCA Candidate @ SVU (2026)
Passionate about data, application development and MLOps.

⭐️ Star this repository if you found it helpful!
📬 Feel free to contribute or open issues!

About

This project aims to identify potential fraud among healthcare providers using medical claims data. Leveraging structured datasets, we integrate, clean, and transform raw information from multiple sources to prepare it for modeling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages