MIT-Emerging-Talent · MahdiaAhmadi · Aug 10, 2025 · Aug 6, 2025 · Aug 6, 2025 · Aug 6, 2025
diff --git a/0_domain_study/README.md b/0_domain_study/README.md
@@ -44,7 +44,7 @@ We particularly focus on the **linguistic layer**—how language is engineered t
 
 We updated the research question from the above to
 
->**A Linguistic Analysis of Phishing vs Legitimate Emails: Identifying Common Patterns and Language Tactics**
+>**How do phishing emails differ from legitmate emails interms of common linguistic patterns and language tactics?**
 
 Given email’s dominant role in phishing, and the centrality of language in deceiving recipients, this research question aims to uncover patterns in wording, tone, and psychological triggers. The revised question aims to address for the lack of data in user click through behavior while still uncovering lingustic patterns
 in phishing and legitimate emails.

diff --git a/1_datasets/README.md b/1_datasets/README.md
@@ -2,32 +2,48 @@
 
 ## Dataset Overview
 
-This folder contains the datasets used for our phishing email detection research.
+This folder contains all key datasets used across our project — from input data to the final feature-rich dataset used for analysis and modeling.
 
-### Enron.csv
+---
 
-**Source**: Enron Email Dataset - publicly available corporate email dataset
+### 📂 Enron.csv
 
-**Collection Method**: Historical email data from Enron Corporation, preprocessed for phishing detection research
+**Description**: This file contains the labeled, structured version of the Enron emails used as the **starting point** of our analysis.
 
-**Size**: 29,767 emails
+**Size**: 29,767 emails  
+**Columns**: `subject`, `body`, `label`
 
-- Safe emails: 15,791 (53.05%)
-- Phishing emails: 13,976 (46.95%)
+**Use**: Input to the cleaning and preprocessing pipeline (`2_data_preparation/`)
 
-**Connection to Research Question**: This dataset allows us to analyze linguistic patterns and psychological manipulation tactics used in phishing emails compared to legitimate business communications. The balanced nature of the dataset ensures fair comparison between phishing and safe email characteristics.
+---
 
-**Structure**:
+### 📂 Enron_cleaned.csv
 
-- Email content (subject + body)
-- Binary classification label (phishing/safe)
-- Preprocessed and cleaned text data
+**Description**: Output from our cleaning pipeline. It includes normalized text under `body_clean` and drops malformed rows.
 
-**Limitations and Caveats**:
+**Size**: 29,710 emails  
+**Columns**: `subject`, `body`, `label`, `body_clean`
 
-- Historical data may not reflect current phishing techniques
-- Corporate email context may not generalize to personal email patterns
-- Dataset balance is artificially maintained and may not reflect real-world proportions
-- Some preprocessing artifacts may be present from the original Enron dataset
+**Use**: Used in exploratory and statistical steps before full feature extraction
 
-**Usage**: This dataset is used by scripts in `/2_data_preparation` for cleaning and feature extraction, and by analysis scripts in `/4_data_analysis` for modeling and statistical analysis.
+---
+
+### 📂 phishing_analysis_dataset.csv
+
+**Description**: This is the **final dataset** used for all modeling and visualization. It includes all engineered linguistic features (e.g. sentiment, punctuation ratios, readability scores).
+
+**Size**: 29,710 emails  
+**Columns**: 31 (including engineered features like `urgency_words`, `flesch_score`, `sentiment_neg`, etc.)
+
+**Generated by**: `phishing_analysis.py`  
+**Used in**: All statistical comparisons, model training, confidence interval calculation, and plotting
+
+---
+
+## Summary Table
+
+| File Name                    | Description                            | Used In                         | Status           |
+|-----------------------------|----------------------------------------|----------------------------------|------------------|
+| Enron.csv                   | Input dataset (structured, labeled)   | `2_data_preparation/`           | ✅ Initial Input  |
+| Enron_cleaned.csv           | Cleaned version with `body_clean`     | `3_data_exploration/`           | ⚠️ Transitional   |
+| phishing_analysis_dataset.csv | Final dataset with all features        | `4_data_analysis/`, all plots    | ✅ Final version  |
diff --git a/2_data_preparation/README.md b/2_data_preparation/README.md
@@ -26,7 +26,8 @@ This folder contains scripts and notebooks for cleaning, transforming, and prepa
 
 **Output**:
 
-- Cleaned and feature-enriched dataset with 22 linguistic and psychological features
+- Cleaned and feature-enriched dataset with **26** linguistic and psychological features
+- Saved as `Enron_cleaned.csv` in `/1_datasets`
 - Ready for statistical analysis and machine learning modeling
 
 **Dependencies**: See `../4_data_analysis/requirements.txt` for required Python packages
@@ -56,6 +57,6 @@ We convert text into measurable patterns using several techniques:
 To run the data cleaning pipeline:
 
 1. Ensure the raw dataset is in `/1_datasets/Enron.csv`
-2. Open `data_cleaning.ipynb` in Jupyter Notebook
+2. Open `data_cleaning.py`
 3. Run all cells to process the data and extract features
-4. The cleaned dataset will be ready for analysis in `/4_data_analysis`
+4. The cleaned dataset will be saved as `Enron_cleaned.csv` in `/1_datasets`
diff --git a/4_data_analysis/enhanced_phishing_analysis_report.txt b/4_data_analysis/enhanced_phishing_analysis_report.txt
@@ -39,7 +39,7 @@ TOP PHISHING-ASSOCIATED TERMS:
  1. www (score: 0.0286)
  2. click (score: 0.0245)
  3. save (score: 0.0223)
- 4. email (score: 0.0222)
+ 4. email (score: 0.0223)
  5. money (score: 0.0205)
  6. online (score: 0.0197)
  7. free (score: 0.0177)
@@ -53,18 +53,18 @@ TOP PHISHING-ASSOCIATED TERMS:
 15. site (score: 0.0148)
 
 ENHANCED CLASSIFICATION MODEL PERFORMANCE:
-- Overall Accuracy: 0.821
-- Phishing Detection Precision: 0.850
-- Phishing Detection Recall: 0.750
-- Phishing Detection F1-Score: 0.797
+- Overall Accuracy: 0.820
+- Phishing Detection Precision: 0.849
+- Phishing Detection Recall: 0.749
+- Phishing Detection F1-Score: 0.796
 
 TOP PREDICTIVE FEATURES:
-- exclamation_ratio: 0.291
+- exclamation_ratio: 0.292
 - type_token_ratio: 0.126
 - flesch_score: 0.095
 - punctuation_density: 0.086
 - financial_words: 0.079
-- sentiment_compound: 0.077
+- sentiment_compound: 0.076
 - word_count: 0.068
 - url_count: 0.046
 

diff --git a/4_data_analysis/feature_comparison_stats.csv b/4_data_analysis/feature_comparison_stats.csv
@@ -1,25 +1,25 @@
 ,phishing_mean,phishing_std,safe_mean,safe_std,difference,effect_size,95%_CI_difference,p_value
-word_count,229.45874256219085,338.42479490387757,271.71562718101643,938.6766677151167,-42.25688461882558,-0.05989080700815021,,
-sentence_count,18.065165961717685,44.98444889829016,18.817778059767782,65.23891209183068,-0.752612098050097,-0.013431242220297322,,
-char_count,1261.6056348125314,1895.0149599090546,1461.7932872279678,5144.609049541321,-200.18765241543633,-0.05163826451026195,,
-type_token_ratio,0.676272509612358,0.18277097878838935,0.6099932484182592,0.17355771952925933,0.06627926119409877,0.3718876282707118,,
-avg_word_length,4.329588124907791,1.0875173384274401,4.029013315761807,0.61732939706783,0.3005748091459841,0.3399213360821731,,
-avg_sentence_length,16.14292726877563,27.391463093997732,15.095832721402596,12.205022453433745,1.0470945473730335,0.04938095763731769,,
-exclamation_ratio,0.00258205162144594,0.005607248372439563,0.0007311019203694226,0.0044770387008892885,0.0018509497010765173,0.3648117292741645,,
-question_ratio,0.002605454707586778,0.024963834017119303,0.0009059561364641213,0.005359764579934193,0.0016994985711226566,0.09413228929317302,,
-punctuation_density,0.025933549265703328,0.031056273019543106,0.025091925750876924,0.018787099165349225,0.0008416235148264034,0.03279187511662606,,
-special_char_ratio,0.03412708661409992,0.03675890270827914,0.034685958897882584,0.02664277921940988,-0.000558872283782666,-0.01740936056866191,,
-url_count,0.25227614882787297,0.8460183669184566,0.11097011610938393,0.8133137575412586,0.14130603271848904,0.17028368515744402,,
-email_pattern_count,0.4296365330848089,1.4328029552811405,2.475287101072267,11.261532792336023,-2.0456505679874577,-0.25483673571598164,,
-flesch_score,50.17407594394146,42.3467575350187,57.730381039372766,20.71069547921951,-7.556305095431306,-0.22669126660640995,,
-flesch_grade,10.74697726982756,12.390217360390844,9.240431300839425,4.609773961673566,1.506545968988135,0.16116363192792216,,
-fog_index,13.241345652256554,12.898044821827702,11.271360029023665,4.918198241673709,1.9699856232328887,0.20182529238853789,,
-sentiment_neg,0.053926016201878264,0.0585460140820234,0.027808387792652753,0.044393868570967186,0.02611762840922551,0.5027060839089456,"(0.024924, 0.027311)",0.0
-sentiment_neu,0.8144629005663488,0.09809430655018972,0.859988896643614,0.08796500770052453,-0.04552599607726515,-0.48864717998687024,,
-sentiment_pos,0.13131952111262457,0.08476450486969467,0.11220233487722861,0.08204886178635463,0.01911718623539596,0.2291741228447618,,
-sentiment_compound,0.5128887662198007,0.5833740313256381,0.5931555992640061,0.47307740190659414,-0.08026683304420534,-0.15113411063475496,,
-urgency_words,0.31658183382321314,0.6252028418668598,0.1944673561322251,0.5838867207176203,0.12211447769098804,0.20187626923054244,,
-action_words,0.789447272205893,1.1569730563733276,0.5742655922847535,1.0244263630758885,0.21518167992113946,0.19692454666401388,,
-financial_words,1.1741343465481398,2.0271322010979036,0.6082101389505742,1.5126307806868995,0.5659242075975656,0.3164275866874213,,
-fear_words,0.1564269840131909,0.46920329212773615,0.16946894232599455,0.5939814713140643,-0.013041958312803636,-0.024366518801252283,,
-reward_words,0.9474514302100508,1.3868385062164348,0.5891123659666265,0.9133489612409881,0.3583390642434243,0.3051751996869503,,
+word_count,229.45874256219085,338.4247949038765,271.71562718101643,938.6766677151041,-42.25688461882558,-0.05989080700815094,,
+sentence_count,18.065165961717685,44.98444889829001,18.817778059767782,65.23891209183039,-0.752612098050097,-0.013431242220297379,,
+char_count,1261.6056348125314,1895.0149599090582,1461.7932872279678,5144.609049541386,-200.18765241543633,-0.05163826451026136,,
+type_token_ratio,0.6762725096123581,0.18277097878838924,0.6099932484182591,0.17355771952925886,0.06627926119409899,0.37188762827071364,,
+avg_word_length,4.329588124907791,1.0875173384274401,4.029013315761807,0.6173293970678332,0.3005748091459841,0.3399213360821727,,
+avg_sentence_length,16.142927268775626,27.391463093997704,15.095832721402596,12.205022453433712,1.04709454737303,0.049380957637317584,,
+exclamation_ratio,0.00258205162144594,0.00560724837243995,0.0007311019203694224,0.004477038700889361,0.0018509497010765175,0.3648117292741469,,
+question_ratio,0.002605454707586778,0.02496383401711473,0.0009059561364641213,0.005359764579933931,0.0016994985711226566,0.0941322892931897,,
+punctuation_density,0.02593354926570333,0.031056273019543484,0.025091925750876924,0.01878709916534906,0.0008416235148264069,0.03279187511662598,,
+special_char_ratio,0.03412708661409992,0.03675890270827934,0.034685958897882584,0.02664277921940962,-0.000558872283782666,-0.017409360568661908,,
+url_count,0.25227614882787297,0.8460183669184124,0.11097011610938393,0.8133137575413496,0.14130603271848904,0.17028368515743947,,
+email_pattern_count,0.4296365330848089,1.4328029552810346,2.475287101072267,11.261532792337706,-2.0456505679874577,-0.2548367357159445,,
+flesch_score,50.17407594394146,42.34675753501857,57.730381039372766,20.71069547921926,-7.556305095431306,-0.22669126660641106,,
+flesch_grade,10.74697726982756,12.390217360390865,9.240431300839425,4.609773961673573,1.506545968988135,0.16116363192792188,,
+fog_index,13.241345652256554,12.898044821827723,11.271360029023663,4.918198241673677,1.9699856232328905,0.2018252923885379,,
+sentiment_neg,0.05392601620187828,0.05854601408202173,0.027808387792652753,0.04439386857096694,0.026117628409225525,0.5027060839089559,"(0.024924, 0.027311)",0.0
+sentiment_neu,0.8144629005663488,0.098094306550191,0.859988896643614,0.08796500770052394,-0.04552599607726515,-0.48864717998686824,,
+sentiment_pos,0.13131952111262457,0.08476450486969499,0.1122023348772286,0.08204886178635548,0.019117186235395975,0.2291741228447604,,
+sentiment_compound,0.5128887662198007,0.5833740313256403,0.5931555992640061,0.4730774019065854,-0.08026683304420534,-0.15113411063475576,,
+urgency_words,0.31658183382321314,0.6252028418668634,0.1944673561322251,0.5838867207176504,0.12211447769098804,0.20187626923053695,,
+action_words,0.789447272205893,1.1569730563733038,0.5742655922847535,1.0244263630757995,0.21518167992113946,0.19692454666402365,,
+financial_words,1.1741343465481398,2.027132201097744,0.6082101389505742,1.5126307806868677,0.5659242075975656,0.31642758668743964,,
+fear_words,0.1564269840131909,0.4692032921277927,0.16946894232599455,0.5939814713141618,-0.013041958312803636,-0.024366518801248696,,
+reward_words,0.9474514302100508,1.3868385062163942,0.5891123659666265,0.9133489612411005,0.3583390642434243,0.3051751996869452,,
diff --git a/4_data_analysis/findings_summary.md b/4_data_analysis/findings_summary.md
@@ -0,0 +1,84 @@
+# Findings Summary
+
+This study investigates the linguistic and structural signals that distinguish phishing emails from legitimate ones, and evaluates how effectively these can be used for automated detection.
+
+## Research Question
+
+**How do phishing emails differ from legitimate emails in terms of common linguistic patterns and language tactics?**
+
+### Subquestions
+
+- What specific linguistic features show measurable differences between phishing and legitimate emails?
+- Are these differences statistically significant and meaningful?
+- Can a machine learning model trained on these features accurately classify emails as phishing or legitimate?
+
+---
+
+## Hypothesis
+
+Phishing emails exhibit distinct linguistic features—such as tone, structure, and word usage—that differ measurably from legitimate emails and can be used to accurately detect phishing attempts.
+
+---
+
+## What We Analyzed
+
+- **Dataset**: 29,711 Enron emails (53.05% safe, 46.95% phishing)
+- **Features**: 22 linguistic, readability, psychological, and structural variables.
+- **Statistical Test**: Welch’s t-test for p-values; Cohen’s d for effect size using the report’s scale:
+  - 🟥 **Large**: |d| ≥ 0.50  
+  - 🟧 **Medium**: 0.30 ≤ |d| < 0.50  
+  - 🟨 **Small**: 0.20 ≤ |d| < 0.30
+- **Model**: Random Forest Classifier (100 trees, `class_weight='balanced'`)
+  - Used for **phishing email detection** by learning patterns from linguistic and structural features.
+  - Combines predictions from multiple decision trees to improve accuracy and reduce overfitting.
+  - **Evaluation**: 5-fold cross-validation.
+
+---
+
+## Cohen’s d Effect Sizes
+
+🟥 **Large (≥ 0.50)**   sentiment_neg (0.503)
+
+🟧 **Medium (0.30–<0.50)**  sentiment_neu (-0.489), type_token_ratio (0.372), exclamation_ratio (0.365), avg_word_length (0.340), financial_words (0.316), reward_words (0.305)
+
+🟨 **Small (0.20–<0.30)** | url_count (0.270), action_words (0.197), urgency_words (0.202), flesch_score (-0.227), sentiment_pos (0.229), email_pattern_count (-0.255)
+
+## Model Performance — Confusion Matrix
+
+|                 | Predicted Phishing 🟥 | Predicted Safe  🟩 |
+|-----------------|----------------------|------------------    |
+| **Actual Phishing 🟥** | **TP = 10,752** | **FN = 3,197**    |
+| **Actual Safe 🟩**     | **FP = 2,493**  | **TN = 13,268**   |
+
+**Metrics:**
+
+- **Accuracy**: 83.5% — overall correctness.
+- **Precision (Phishing)**: 86.4% — of predicted phishing emails, % actually phishing.
+- **Recall (Phishing)**: 77.1% — of actual phishing emails, % correctly found.
+- **F1-score**: 81.5% — balance between precision and recall.
+
+---
+
+## Limitations
+
+- **Temporal Bias**: Enron dataset (2001) may not reflect modern phishing tactics.
+- **Domain Specificity**: Corporate email context limits generalizability.
+- **Lexicon Bias**: VADER sentiment and static keyword lists may miss nuanced or emerging manipulation.
+- **Model Interpretability**: While the Random Forest achieved high accuracy, it is not highly interpretable compared to simpler models.
+- **No External Validation**: Model performance was evaluated only with 5-fold cross-validation on the Enron dataset; generalization to other datasets has not been tested.
+
+---
+
+## Conclusion
+
+Using the report’s Cohen’s d scale, **negative sentiment** was the only feature in the large category, with several medium and small effects observed. While many individual features are not strong enough to detect phishing alone, **in combination they form a powerful detection tool** — as demonstrated by the Random Forest model’s 83.5% accuracy in
+phishing email detection.
+
+However, since the model was only evaluated with 5-fold cross-validation on the Enron dataset, the next step should include **external validation** on more recent and diverse datasets to confirm generalizability and ensure robustness against evolving phishing tactics.
+
+---
+
+## Next Steps
+
+- Perform **external validation** using datasets from different domains and recent time periods to assess real-world applicability.
+- Compare the Random Forest’s performance with more interpretable models to balance accuracy and explainability.