MIT-Emerging-Talent · ggmeklit · Jun 1, 2025 · Jun 3, 2025 · Jun 8, 2025 · Jun 8, 2025
diff --git a/.markdownlint.yml b/.markdownlint.yml
@@ -1,3 +1,7 @@
 ignore:
   - venv
   - .github
+
+MD013:
+  line_length: 350
+
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -122,5 +122,10 @@
       "source.fixAll.ruff": "explicit",
       "source.organizeImports.ruff": "explicit"
     }
-  }
+  },
+  "cSpell.words": [
+    "Nazario",
+    "NLTK",
+    "stopwords"
+  ]
 }
diff --git a/0_domain_study/README.md b/0_domain_study/README.md
@@ -1 +1,87 @@
-# Domain Research
+# 🛡️ Domain Study: Phishing and Linguistic Influence on User Behavior
+
+Welcome to the `0_domain_study` folder! This section summarizes our team's research into phishing — specifically the linguistic features that affect user click-through behavior. Below you'll find a structured overview of our research domain, background, and actionable insights.
+
+---
+
+## 📌 Problem Statement (Based on Team's Personal Experiences)
+
+**Research Question:**  
+_What type of linguistic features in phishing emails influence user click-through behavior?_
+
+Phishing is a growing concern globally. Based on personal experiences:
+
+- **Meklit** (Canada) frequently encounters smishing and phishing at work. One incident involved a fake OneDrive link flagged by IT — highlighting how rushed environments reduce our vigilance.
+- **Mahdia** (Portugal) emphasized how scammers use sophisticated linguistic techniques to appear trustworthy and manipulate users.
+- **Ahmad** often receives fake IT department emails at work and personal prize scams. He noticed a clear difference in language tone: urgent and professional at work vs emotional in personal life.
+- **Semira** initially was intrigued about fake malware by watching anti-virus software finds. She bolstered her cybersecurity knowledge through research and workshop to spot psychological manipulation such as fake tax threats or UPS warnings. She now verifies and reports these attempts.
+- **Musab** (USA) stressed the emotional toll of constant phishing attempts and how phishing poses both legal and financial risks.
+
+Together, we observed that phishing strategies are becoming more **emotionally manipulative**, **context-aware**, and **linguistically advanced**, requiring in-depth study of their language patterns.
+
+---
+
+## 🧠 Our Understanding of the Problem Domain (Using Systems Thinking)
+
+Phishing is a **socio-technical** problem involving three interconnected components:
+
+1. **Phishers (Attackers):**  
+   Skilled in social engineering. Use language to create urgency, trust, fear, or curiosity.
+
+2. **Communication Channels:**  
+   Mainly email, but also SMS (smishing) and voice (vishing). All channels aim to prompt the user into clicking or responding.
+
+3. **Recipients (Targets):**  
+   Everyday users or professionals. Often fall victim due to low awareness, poor digital hygiene, or stress.
+
+We particularly focus on the **linguistic layer**—how language is engineered to bypass cognitive defenses and influence behavior.
+
+---
+
+## ❓ Research Question
+
+> **"What type of linguistic features in phishing emails influence user click-through behavior?"**
+
+Given email’s dominant role in phishing, and the centrality of language in deceiving recipients, this research question aims to uncover patterns in wording, tone, and psychological triggers.
+
+---
+
+## 📚 Background Review of the Domain
+
+### 1. **Human Psychology and Language Triggers**
+
+- **Emotions** like fear, urgency, or reward are widely used in phishing (Jakobsson & Myers, 2006).
+- **Users acting under pressure** are less likely to evaluate messages critically (Vishwanath et al., 2011).
+
+### 2. **Phishing Detection Tools**
+
+- Tools like email filters, browser warnings, and ML-based classifiers can detect known phishing messages (Bergholz et al., 2010).
+- However, attackers adapt quickly with new linguistic patterns to bypass these systems.
+
+### 3. **User Education**
+
+- Training and awareness programs are effective but vary in success.
+- **Interactive and ongoing training** is more impactful than one-off sessions (Jansson & Von Solms, 2013).
+
+### 4. **Evolving Threat Landscape**
+
+- **Spear phishing** and **smishing** are on the rise (Hong, 2012).
+- Smartphones and social platforms open new vectors.
+- Despite evolution, **email remains the most common attack method** (CISA, 2023).
+
+Click here: [Full Background Review](https://docs.google.com/document/d/1at2nE_Ladr2_HlNFqoaHtACwAhOVvcFE6qYVRcrerbg/edit?tab=t.0)
+
+### Conclusion
+
+Phishing success stems largely from **manipulating language to trigger impulsive reactions**. Understanding this manipulation can help in detection and prevention.
+
+---
+
+## 📂 Resources & References
+
+- **Bergholz et al. (2010)** – Email filtering via ML  
+- **Hong (2012)** – Evolution of phishing  
+- **Jakobsson & Myers (2006)** – Psychological manipulation in phishing  
+- **Jansson & Von Solms (2013)** – Phishing education effectiveness  
+- **Vishwanath et al. (2011)** – User susceptibility factors
+- **CISA (2023)** – Counter-Phishing Recommendations for Federal Agencies
diff --git a/0_domain_study/retrospective.md b/0_domain_study/retrospective.md
@@ -0,0 +1,51 @@
+# Team Retrospective
+
+Over the past two weeks, our team focused on establishing group dynamics, aligning our goals, and initiating our research direction.
+
+## Week 1
+
+- We began by forming our group and establishing group structures and work plans.
+- Some team members took initiative to create shared documents for:
+  - Availability scheduling
+  - Individual learning goals
+  - Brainstorming research questions
+  - Outlining constraints and risks
+  - Choosing communication channels and project management tools
+
+- Based on the collected availability, we scheduled our first team meeting.
+
+### First Meeting Highlights
+
+- Introductions and initial team bonding
+- Overview and clarification of the project scope
+- Review of shared project documents
+- Brainstorming of potential research domains and questions
+- We expressed strong interest in Mahdia’s proposed questions around **phishing and the structure of phishing attempts**, and decided to pursue that direction further
+- The meeting was conducted via Slack Huddle and proved effective
+- We assigned everyone a couple of days for individual research to assess:
+  - Technical feasibility
+  - Data availability
+  - Timeline and scope
+  - Relevance and social impact of the project
+
+## Week 2
+
+- We scheduled our second meeting to discuss findings and finalize the research question.
+
+### Second Meeting Highlights
+
+- Shared research outcomes and assessed feasibility of the proposed phishing-related research
+- Refined and updated shared project documents
+- Divided remaining tasks to meet the milestone requirements
+- Revisited expectations: following Evan’s guidance, we agreed that **data acquisition** would not be our main concern at this stage; instead, our focus is to formulate a **socially beneficial and impactful research question**
+- Agreed to remain flexible and iterate on the research question in later stages if needed
+
+### Action Points
+
+- Gave each team member one more day to complete individual tasks
+- Final milestone deliverables will be pushed to GitHub by **end of day, 6/16**
+- Addressed communication issues and introduced new engagement guidelines to ensure everyone stays involved
+
+---
+
+We feel we’ve built a strong foundation as a team, aligned on a promising research area, and are moving forward with clear next steps.
diff --git a/2_data_preparation/README.md b/2_data_preparation/README.md
@@ -1 +1,112 @@
-# Data Preparation
+# How We Model Our Research and Possible Limitations
+
+Our research focuses on identifying linguistic differences between phishing emails and legitimate ones by studying how language is used in both.
+Phishing emails often use certain writing tactics to trick or pressure readers, and we aim to uncover those patterns. To do this, we analyze the content of emails,
+using mathematical models that help us understand and visualize natural language. We chose raw and mostly real language datasets
+in order to keep our research broad and be able to apply different algorithms to help ourselves approach the data from different angles.
+
+We work with large datasets made up of real emails. Each email includes a subject line, the main message (body), and a label marking it as either phishing or safe.
+
+To study these emails, we will use several techniques that convert text into a form we can measure. For example:
+
+- We will use tools that highlight which words are most frequent, important or unique in a message. This helps us identify phrases that show up more often in phishing emails than in legitimate ones.
+- We are going to apply topic modeling methods that reveal the main themes in groups of emails. This helps us understand what phishing messages tend to talk about even if they use different words.
+- We will also study how words relate to each other across different emails. By doing this, we can map the “meaning” and context of certain words, and how they are used differently in scam emails compared to safe ones.
+- We will be analyzing how each category of linguistic feature increases or decreases the likelihood of an email being phishing or non-phishing.
+
+These techniques help us turn language into patterns, allowing us to visualize and analyze how phishing emails are written and why they might be effective.
+
+---
+
+## About Our Data
+
+Our source dataset was obtained from Zenodo, a reputable open-access research data platform. It was published by researchers Arifa I. Champa, Fazle Rabbi, and Minhaz F. Zibran in August 2024. The dataset is also referenced in multiple peer-reviewed academic papers, making it a trustworthy source for phishing research.
+
+The Zenodo phishing validation dataset (2024) includes 11 datasets gathered from both historical corpora and curated sources. We used `Nazario.csv` and `Nazario_5.csv`.
+
+- `Nazario.csv` is a well-known phishing corpus created by Jose Nazario in the mid-2000s using spam traps and public reports, containing phishing emails.
+- `Nazario_5.csv` was curated by combining legitimate messages to create a balanced dataset for classification; it includes both phishing and non-phishing emails.
+
+Our datset is composed of 1551 phishing and 1497 safe emails.
+
+## Limitations of our data
+
+Many of our emails are from early and mid 2000's, while still widely applicable, phishing emails are evolving significantly.
+
+The Nazario_5 dataset fails to explicitly state where the non-phishing and recent emails originate from but were likely curated from other sources
+
+Since our dataset is partially processed and curated from multiple sources, we cannot fully verify whether label assignment or content preservation was performed without error during cleaning, merging, or formatting.
+
+Some emails have missing subject or body values entirely, reducing the effectiveness of content-based analysis. In some cases,the original dataset spans multiple lines within a single CSV cell therefore have to be excluded from our datasets.
+
+Our final dataset after the cleaning script contains 1551 phishing and 1497 non-phishing emails , an imbalance between the two and might affect conclusion.
+ We aim to mitigate it by :
+
+-For word/category analysis: Use percentages within each class (phishing vs. safe), not raw counts.
+
+-For machine learning: Apply techniques like class weighting, oversampling (adding more safe emails), or undersampling (reducing phishing emails).
+
+-For model evaluation: Use metrics that account for imbalance, instead of just accuracy.
+
+## Limitations of Our Approach
+
+While our method gives us valuable insights, we’re aware of some limitations:
+
+- **Lack of Regional and Cultural Focus**: Our data is mostly global and not grouped by region or culture. Since writing styles vary across different places, this could limit how well our findings apply to specific populations.
+
+- **English-Only Data**: All of our analysis is based on emails written in English. This means our results may not apply to phishing messages written in other languages.
+
+- **Combining Different Datasets**: Our research brings together emails from different sources, which can sometimes lead to inconsistencies in how messages are labeled or formatted. This might affect the accuracy of our comparisons.
+
+- **Language Is Complex**: While mathematical models are powerful, they don’t always capture the full meaning behind language—like tone, sarcasm, or cultural references—which are often key parts of communication and deception.
+
+Despite these challenges, our goal is to use language-based patterns to better understand phishing techniques and support the development of tools that can detect these threats more effectively. In addition to the research above, we also aim to analyze features like punctuation marks, length, grammatical error rates,
+etc. to investigate more possible patterns
+
+## Data Cleaning Scripts
+
+## Objective
+
+To create a fully reusable and reproducible Python script pipeline that:
+
+1. Cleans and preprocesses raw email data.  
+2. Prepares the text for linguistic analysis.  
+3. Organizes it into training and validation sets for future modeling tasks.  
+
+---
+
+## Step-by-Step Plan
+
+### 1. Input Requirements
+
+- **Expected input format**: CSV (or structured `.eml` in future versions).  
+- **Must include at least two columns**:  
+  - **Email Text**: the full message body.  
+  - **Email Type**: the label indicating if it's a phishing or legitimate email.  
+
+### 2. Data Cleaning and Processing Pipeline
+
+This script will follow a detailed pipeline inspired by Champa et al. (2024), adaptable for real-world, messy email data:
+
+a. **Deduplication**  
+   Remove repeated emails by comparing the content of **Email Text**.
+
+b. **Discrepancy Removal**  
+   Drop rows with missing, empty, or null text entries.  
+   *Optionally* integrate language detection (e.g., `langdetect`) to retain only English emails.
+
+c. **HTML & Text Normalization**  
+   Strip HTML tags using **BeautifulSoup**. Retain only visible, meaningful plain text.
+
+d. **Case Normalization and Punctuation Cleanup**  
+   Convert all text to lowercase. Remove punctuation, symbols, and excessive whitespace using regex.
+
+e. **Stop Word Removal**  
+   Use **NLTK** or **spaCy** to filter out frequent stop words (e.g., “is”, “the”, “and”).
+
+f. **(Optional) Tokenization**  
+   If needed for later analysis or feature extraction, tokenize the cleaned text into word units.
+
+---
+
+*Once your merged dataset is finalized, simply run `clean_data.py` in this folder to produce `cleaned_phishing_dataset.csv` ready for modeling.*  
diff --git a/2_data_preparation/clean_data.py b/2_data_preparation/clean_data.py
@@ -0,0 +1,74 @@
+#!/usr/bin/env python3
+"""
+Author: Ahmad Hamed Dehzad
+Date:   2025-06-30
+
+Description:
+    Load, clean, and save the merged phishing dataset.
+    - Fills missing fields
+    - Parses dates
+    - Cleans HTML, punctuation, and whitespace from text
+    - Drops duplicates based on subject AND body
+    - Outputs cleaned_phishing_dataset.csv
+"""
+
+import os
+import pandas as pd
+import re
+from bs4 import BeautifulSoup
+import warnings
+from bs4 import MarkupResemblesLocatorWarning
+
+# Suppress BeautifulSoup locator warnings
+warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)
+
+
+def clean_text(text: str) -> str:
+    """Remove HTML, punctuation, extra whitespace, and lowercase the text."""
+    # strip HTML
+    text = BeautifulSoup(str(text), "html.parser").get_text()
+    # collapse whitespace
+    text = re.sub(r"\s+", " ", text)
+    # remove punctuation
+    text = re.sub(r"[^\w\s]", "", text)
+    return text.lower().strip()
+
+
+def main():
+    # Determine the directory where this script lives
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+
+    # Build full paths for input and output
+    input_path = os.path.join(script_dir, "merged_phishing_dataset.csv")
+    output_path = os.path.join(script_dir, "cleaned_phishing_dataset.csv")
+
+    # 1. Load the dataset
+    df = pd.read_csv(input_path)
+
+    # 2. Handle missing values
+    df["sender"] = df["sender"].fillna("missing")
+    df["receiver"] = df["receiver"].fillna("missing")
+    df["subject"] = df["subject"].fillna("missing")
+    df["body"] = df["body"].fillna("")
+
+    # 3. Parse dates (with UTC to avoid mixed-timezone issues)
+    df["date"] = pd.to_datetime(df["date"], errors="coerce", utc=True)
+    df["date"] = df["date"].dt.tz_localize(None).fillna(pd.Timestamp("1970-01-01"))
+
+    # 4. Clean text columns
+    df["subject"] = df["subject"].apply(clean_text)
+    df["body"] = df["body"].apply(clean_text)
+    df["urls"] = df["urls"].astype(str).apply(clean_text)
+
+    # 5. Drop duplicates where both cleaned subject and body match exactly
+    df = df.drop_duplicates(subset=["subject", "body"], keep="first").reset_index(
+        drop=True
+    )
+
+    # 6. Save cleaned dataset
+    df.to_csv(output_path, index=False)
+    print(f"Done! Saved cleaned dataset to:\n  {output_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/2_data_preparation/cleaned_phishing_dataset.csv b/2_data_preparation/cleaned_phishing_dataset.csv
diff --git a/3_data_exploration/README.md b/3_data_exploration/README.md
@@ -1 +1,12 @@
 # Data Exploration
+
+Our datset contains 1551 phishing and 1497 safe emails.Below we visually summarize the total class phishing and safe emails
+
+![Class Distribution](class_distribution.png)
+
+Other important things we will be looking at in order to better understand our data is the frequency of words
+
+![Frequent Words](top_words.png)
+
+And the existence of URL is another aspect we aim to analyze in phishing vs non phishing emails
+![URL_Existence](url_proportion.png)
diff --git a/3_data_exploration/class_distribution.png b/3_data_exploration/class_distribution.png
diff --git a/3_data_exploration/top_words.png b/3_data_exploration/top_words.png
diff --git a/3_data_exploration/url_proportion.png b/3_data_exploration/url_proportion.png
diff --git a/README.md b/README.md
@@ -0,0 +1,34 @@
+# 🐼 The Pandas Pact
+
+Welcome to the official repository of **The Pandas Pact** – a cross-cultural, collaborative team committed to learning, growing, and delivering impactful work together. This repository houses all project materials, documentation, meeting notes, and resources shared across our team.
+
+---
+
+## 🌍 About Us
+
+**The Pandas Pact** is a diverse team of global learners working together with respect, integrity, and creativity. Our mission is to foster a safe, inclusive, and productive environment where every member contributes meaningfully while embracing cultural differences.
+
+---
+
+## ✅ Group Norms
+
+Our norms guide how we work together. We believe in:
+
+- Respecting each other's time and communication styles
+- Open, constructive feedback
+- Clear expectations and accountability
+- Empathy, support, and trust in each other
+- Regular reflection and adaptation of our practices
+
+📄 You can read our full group norms (/collaboration/README.md)
+
+---
+
+## 💬 Communication
+
+We coordinate and communicate using:
+
+- **Slack Channel**: `#et6_cdsp_group_14`
+- **Meetings**: Scheduled weekly or as needed via Zoom or Google Meet (details shared in Slack)
+- **Documentation**: Shared and updated here in the repo
+- **Meeting Notes**: Action items and summaries are posted in Slack and optionally stored in this repo