Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
2f6582f
Group norms + Repo readme file update!
MahdiaAhmadi Jun 1, 2025
33657b5
Merge branch 'MIT-Emerging-Talent:main' into main
MUSABKAYMAK Jun 3, 2025
e08b60b
Merge branch 'MIT-Emerging-Talent:main' into main
MUSABKAYMAK Jun 8, 2025
78965f3
Update .markdownlint.yml
MahdiaAhmadi Jun 8, 2025
57326ea
Update .markdownlint.yml
MahdiaAhmadi Jun 10, 2025
f1bb4da
trial
MUSABKAYMAK Jun 10, 2025
3eb4f8e
trial
MUSABKAYMAK Jun 10, 2025
a1ad6a9
change 1 to constraints
ggmeklit Jun 15, 2025
36fb2d3
commit #1 - changes to constraint doc
ggmeklit Jun 15, 2025
31cbf42
fixing linting
ggmeklit Jun 15, 2025
f5cf5a9
fixed header on retrospective file
ggmeklit Jun 15, 2025
850852f
Added meeting notes
ggmeklit Jun 15, 2025
b819f5a
Merge pull request #11 from MIT-Emerging-Talent/Cross_Cultural_Collab…
MahdiaAhmadi Jun 15, 2025
e0f810c
Merge pull request #10 from MIT-Emerging-Talent/1.Problem_identification
MahdiaAhmadi Jun 15, 2025
dd73647
Add milestone 1 deliverables
MahdiaAhmadi Jun 15, 2025
36709ac
Add milestone 1 deliverables
MahdiaAhmadi Jun 15, 2025
f8dafce
Merge branch 'main' into Problem_identification
MahdiaAhmadi Jun 15, 2025
af37fa2
Update retrospective.md
MahdiaAhmadi Jun 15, 2025
a627b38
Merge pull request #13 from MIT-Emerging-Talent/Problem_identification
MUSABKAYMAK Jun 15, 2025
24afc32
update to communication file
ggmeklit Jun 16, 2025
2c205be
Learning goals for the group
SEMIRATESFAI Jun 16, 2025
04aa186
fixed multiple items on learning goals
SEMIRATESFAI Jun 16, 2025
e5f1b94
Submitting Contributing plan for effective virtual collaboration
SEMIRATESFAI Jun 16, 2025
8c51031
fixing error on communication file
ggmeklit Jun 17, 2025
79c7da6
Merge branch 'main' into Cross_Cultural_Collaboration
ggmeklit Jun 17, 2025
98c71e2
update to constraints file
ggmeklit Jun 17, 2025
adac151
Merge branch 'Cross_Cultural_Collaboration' of https://github.com/MIT…
ggmeklit Jun 17, 2025
8aed6d2
final edit on communication
ggmeklit Jun 17, 2025
403e5b4
Merge pull request #16 from MIT-Emerging-Talent/Cross_Cultural_Collab…
SEMIRATESFAI Jun 17, 2025
a585d87
Merge pull request #17 from MIT-Emerging-Talent/learning-goals
ggmeklit Jun 17, 2025
e7e0286
removed older team availablity
ggmeklit Jun 17, 2025
831d196
Merge pull request #18 from MIT-Emerging-Talent/Cross_Cultural_Collab…
SEMIRATESFAI Jun 17, 2025
383cc21
edit and added resource link
SEMIRATESFAI Jun 17, 2025
81082fb
edited and added link
SEMIRATESFAI Jun 17, 2025
6ee2d66
Merge pull request #19 from MIT-Emerging-Talent/learning-goals
ggmeklit Jun 17, 2025
76d5277
refact: Update README and collaboration documents for clarity and str…
MahdiaAhmadi Jun 29, 2025
91fb434
Add data cleaning script and cleaned dataset
AhmadHamedDehzad Jun 30, 2025
a91838a
Update clean_data.py with header comments; add refreshed cleaned_phis…
AhmadHamedDehzad Jun 30, 2025
77059cd
Add data cleaning README with objectives and step-by-step plan
AhmadHamedDehzad Jun 30, 2025
a820f72
read_me updated
ggmeklit Jul 1, 2025
f244e8f
updated_readme
ggmeklit Jul 1, 2025
4763381
Few changes
ggmeklit Jul 1, 2025
4fac1ad
reorganized
ggmeklit Jul 1, 2025
8608972
reorganized
ggmeklit Jul 1, 2025
7835a53
created retrospective file
SEMIRATESFAI Jul 1, 2025
bafdfb4
Update README.md
ggmeklit Jul 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .markdownlint.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
ignore:
- venv
- .github

MD013:
line_length: 350

7 changes: 6 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -122,5 +122,10 @@
"source.fixAll.ruff": "explicit",
"source.organizeImports.ruff": "explicit"
}
}
},
"cSpell.words": [
"Nazario",
"NLTK",
"stopwords"
]
}
88 changes: 87 additions & 1 deletion 0_domain_study/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,87 @@
# Domain Research
# 🛡️ Domain Study: Phishing and Linguistic Influence on User Behavior

Welcome to the `0_domain_study` folder! This section summarizes our team's research into phishing — specifically the linguistic features that affect user click-through behavior. Below you'll find a structured overview of our research domain, background, and actionable insights.

---

## 📌 Problem Statement (Based on Team's Personal Experiences)

**Research Question:**
_What type of linguistic features in phishing emails influence user click-through behavior?_

Phishing is a growing concern globally. Based on personal experiences:

- **Meklit** (Canada) frequently encounters smishing and phishing at work. One incident involved a fake OneDrive link flagged by IT — highlighting how rushed environments reduce our vigilance.
- **Mahdia** (Portugal) emphasized how scammers use sophisticated linguistic techniques to appear trustworthy and manipulate users.
- **Ahmad** often receives fake IT department emails at work and personal prize scams. He noticed a clear difference in language tone: urgent and professional at work vs emotional in personal life.
- **Semira** initially was intrigued about fake malware by watching anti-virus software finds. She bolstered her cybersecurity knowledge through research and workshop to spot psychological manipulation such as fake tax threats or UPS warnings. She now verifies and reports these attempts.
- **Musab** (USA) stressed the emotional toll of constant phishing attempts and how phishing poses both legal and financial risks.

Together, we observed that phishing strategies are becoming more **emotionally manipulative**, **context-aware**, and **linguistically advanced**, requiring in-depth study of their language patterns.

---

## 🧠 Our Understanding of the Problem Domain (Using Systems Thinking)

Phishing is a **socio-technical** problem involving three interconnected components:

1. **Phishers (Attackers):**
Skilled in social engineering. Use language to create urgency, trust, fear, or curiosity.

2. **Communication Channels:**
Mainly email, but also SMS (smishing) and voice (vishing). All channels aim to prompt the user into clicking or responding.

3. **Recipients (Targets):**
Everyday users or professionals. Often fall victim due to low awareness, poor digital hygiene, or stress.

We particularly focus on the **linguistic layer**—how language is engineered to bypass cognitive defenses and influence behavior.

---

## ❓ Research Question

> **"What type of linguistic features in phishing emails influence user click-through behavior?"**

Given email’s dominant role in phishing, and the centrality of language in deceiving recipients, this research question aims to uncover patterns in wording, tone, and psychological triggers.

---

## 📚 Background Review of the Domain

### 1. **Human Psychology and Language Triggers**

- **Emotions** like fear, urgency, or reward are widely used in phishing (Jakobsson & Myers, 2006).
- **Users acting under pressure** are less likely to evaluate messages critically (Vishwanath et al., 2011).

### 2. **Phishing Detection Tools**

- Tools like email filters, browser warnings, and ML-based classifiers can detect known phishing messages (Bergholz et al., 2010).
- However, attackers adapt quickly with new linguistic patterns to bypass these systems.

### 3. **User Education**

- Training and awareness programs are effective but vary in success.
- **Interactive and ongoing training** is more impactful than one-off sessions (Jansson & Von Solms, 2013).

### 4. **Evolving Threat Landscape**

- **Spear phishing** and **smishing** are on the rise (Hong, 2012).
- Smartphones and social platforms open new vectors.
- Despite evolution, **email remains the most common attack method** (CISA, 2023).

Click here: [Full Background Review](https://docs.google.com/document/d/1at2nE_Ladr2_HlNFqoaHtACwAhOVvcFE6qYVRcrerbg/edit?tab=t.0)

### Conclusion

Phishing success stems largely from **manipulating language to trigger impulsive reactions**. Understanding this manipulation can help in detection and prevention.

---

## 📂 Resources & References

- **Bergholz et al. (2010)** – Email filtering via ML
- **Hong (2012)** – Evolution of phishing
- **Jakobsson & Myers (2006)** – Psychological manipulation in phishing
- **Jansson & Von Solms (2013)** – Phishing education effectiveness
- **Vishwanath et al. (2011)** – User susceptibility factors
- **CISA (2023)** – Counter-Phishing Recommendations for Federal Agencies
51 changes: 51 additions & 0 deletions 0_domain_study/retrospective.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Team Retrospective

Over the past two weeks, our team focused on establishing group dynamics, aligning our goals, and initiating our research direction.

## Week 1

- We began by forming our group and establishing group structures and work plans.
- Some team members took initiative to create shared documents for:
- Availability scheduling
- Individual learning goals
- Brainstorming research questions
- Outlining constraints and risks
- Choosing communication channels and project management tools

- Based on the collected availability, we scheduled our first team meeting.

### First Meeting Highlights

- Introductions and initial team bonding
- Overview and clarification of the project scope
- Review of shared project documents
- Brainstorming of potential research domains and questions
- We expressed strong interest in Mahdia’s proposed questions around **phishing and the structure of phishing attempts**, and decided to pursue that direction further
- The meeting was conducted via Slack Huddle and proved effective
- We assigned everyone a couple of days for individual research to assess:
- Technical feasibility
- Data availability
- Timeline and scope
- Relevance and social impact of the project

## Week 2

- We scheduled our second meeting to discuss findings and finalize the research question.

### Second Meeting Highlights

- Shared research outcomes and assessed feasibility of the proposed phishing-related research
- Refined and updated shared project documents
- Divided remaining tasks to meet the milestone requirements
- Revisited expectations: following Evan’s guidance, we agreed that **data acquisition** would not be our main concern at this stage; instead, our focus is to formulate a **socially beneficial and impactful research question**
- Agreed to remain flexible and iterate on the research question in later stages if needed

### Action Points

- Gave each team member one more day to complete individual tasks
- Final milestone deliverables will be pushed to GitHub by **end of day, 6/16**
- Addressed communication issues and introduced new engagement guidelines to ensure everyone stays involved

---

We feel we’ve built a strong foundation as a team, aligned on a promising research area, and are moving forward with clear next steps.
113 changes: 112 additions & 1 deletion 2_data_preparation/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,112 @@
# Data Preparation
# How We Model Our Research and Possible Limitations

Our research focuses on identifying linguistic differences between phishing emails and legitimate ones by studying how language is used in both.
Phishing emails often use certain writing tactics to trick or pressure readers, and we aim to uncover those patterns. To do this, we analyze the content of emails,
using mathematical models that help us understand and visualize natural language. We chose raw and mostly real language datasets
in order to keep our research broad and be able to apply different algorithms to help ourselves approach the data from different angles.

We work with large datasets made up of real emails. Each email includes a subject line, the main message (body), and a label marking it as either phishing or safe.

To study these emails, we will use several techniques that convert text into a form we can measure. For example:

- We will use tools that highlight which words are most frequent, important or unique in a message. This helps us identify phrases that show up more often in phishing emails than in legitimate ones.
- We are going to apply topic modeling methods that reveal the main themes in groups of emails. This helps us understand what phishing messages tend to talk about even if they use different words.
- We will also study how words relate to each other across different emails. By doing this, we can map the “meaning” and context of certain words, and how they are used differently in scam emails compared to safe ones.
- We will be analyzing how each category of linguistic feature increases or decreases the likelihood of an email being phishing or non-phishing.

These techniques help us turn language into patterns, allowing us to visualize and analyze how phishing emails are written and why they might be effective.

---

## About Our Data

Our source dataset was obtained from Zenodo, a reputable open-access research data platform. It was published by researchers Arifa I. Champa, Fazle Rabbi, and Minhaz F. Zibran in August 2024. The dataset is also referenced in multiple peer-reviewed academic papers, making it a trustworthy source for phishing research.

The Zenodo phishing validation dataset (2024) includes 11 datasets gathered from both historical corpora and curated sources. We used `Nazario.csv` and `Nazario_5.csv`.

- `Nazario.csv` is a well-known phishing corpus created by Jose Nazario in the mid-2000s using spam traps and public reports, containing phishing emails.
- `Nazario_5.csv` was curated by combining legitimate messages to create a balanced dataset for classification; it includes both phishing and non-phishing emails.

Our datset is composed of 1551 phishing and 1497 safe emails.

## Limitations of our data

Many of our emails are from early and mid 2000's, while still widely applicable, phishing emails are evolving significantly.

The Nazario_5 dataset fails to explicitly state where the non-phishing and recent emails originate from but were likely curated from other sources

Since our dataset is partially processed and curated from multiple sources, we cannot fully verify whether label assignment or content preservation was performed without error during cleaning, merging, or formatting.

Some emails have missing subject or body values entirely, reducing the effectiveness of content-based analysis. In some cases,the original dataset spans multiple lines within a single CSV cell therefore have to be excluded from our datasets.

Our final dataset after the cleaning script contains 1551 phishing and 1497 non-phishing emails , an imbalance between the two and might affect conclusion.
We aim to mitigate it by :

-For word/category analysis: Use percentages within each class (phishing vs. safe), not raw counts.

-For machine learning: Apply techniques like class weighting, oversampling (adding more safe emails), or undersampling (reducing phishing emails).

-For model evaluation: Use metrics that account for imbalance, instead of just accuracy.

## Limitations of Our Approach

While our method gives us valuable insights, we’re aware of some limitations:

- **Lack of Regional and Cultural Focus**: Our data is mostly global and not grouped by region or culture. Since writing styles vary across different places, this could limit how well our findings apply to specific populations.

- **English-Only Data**: All of our analysis is based on emails written in English. This means our results may not apply to phishing messages written in other languages.

- **Combining Different Datasets**: Our research brings together emails from different sources, which can sometimes lead to inconsistencies in how messages are labeled or formatted. This might affect the accuracy of our comparisons.

- **Language Is Complex**: While mathematical models are powerful, they don’t always capture the full meaning behind language—like tone, sarcasm, or cultural references—which are often key parts of communication and deception.

Despite these challenges, our goal is to use language-based patterns to better understand phishing techniques and support the development of tools that can detect these threats more effectively. In addition to the research above, we also aim to analyze features like punctuation marks, length, grammatical error rates,
etc. to investigate more possible patterns

## Data Cleaning Scripts

## Objective

To create a fully reusable and reproducible Python script pipeline that:

1. Cleans and preprocesses raw email data.
2. Prepares the text for linguistic analysis.
3. Organizes it into training and validation sets for future modeling tasks.

---

## Step-by-Step Plan

### 1. Input Requirements

- **Expected input format**: CSV (or structured `.eml` in future versions).
- **Must include at least two columns**:
- **Email Text**: the full message body.
- **Email Type**: the label indicating if it's a phishing or legitimate email.

### 2. Data Cleaning and Processing Pipeline

This script will follow a detailed pipeline inspired by Champa et al. (2024), adaptable for real-world, messy email data:

a. **Deduplication**
Remove repeated emails by comparing the content of **Email Text**.

b. **Discrepancy Removal**
Drop rows with missing, empty, or null text entries.
*Optionally* integrate language detection (e.g., `langdetect`) to retain only English emails.

c. **HTML & Text Normalization**
Strip HTML tags using **BeautifulSoup**. Retain only visible, meaningful plain text.

d. **Case Normalization and Punctuation Cleanup**
Convert all text to lowercase. Remove punctuation, symbols, and excessive whitespace using regex.

e. **Stop Word Removal**
Use **NLTK** or **spaCy** to filter out frequent stop words (e.g., “is”, “the”, “and”).

f. **(Optional) Tokenization**
If needed for later analysis or feature extraction, tokenize the cleaned text into word units.

---

*Once your merged dataset is finalized, simply run `clean_data.py` in this folder to produce `cleaned_phishing_dataset.csv` ready for modeling.*
74 changes: 74 additions & 0 deletions 2_data_preparation/clean_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/usr/bin/env python3
"""
Author: Ahmad Hamed Dehzad
Date: 2025-06-30

Description:
Load, clean, and save the merged phishing dataset.
- Fills missing fields
- Parses dates
- Cleans HTML, punctuation, and whitespace from text
- Drops duplicates based on subject AND body
- Outputs cleaned_phishing_dataset.csv
"""

import os
import pandas as pd
import re
from bs4 import BeautifulSoup
import warnings
from bs4 import MarkupResemblesLocatorWarning

# Suppress BeautifulSoup locator warnings
warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)


def clean_text(text: str) -> str:
"""Remove HTML, punctuation, extra whitespace, and lowercase the text."""
# strip HTML
text = BeautifulSoup(str(text), "html.parser").get_text()
# collapse whitespace
text = re.sub(r"\s+", " ", text)
# remove punctuation
text = re.sub(r"[^\w\s]", "", text)
return text.lower().strip()


def main():
# Determine the directory where this script lives
script_dir = os.path.dirname(os.path.abspath(__file__))

# Build full paths for input and output
input_path = os.path.join(script_dir, "merged_phishing_dataset.csv")
output_path = os.path.join(script_dir, "cleaned_phishing_dataset.csv")

# 1. Load the dataset
df = pd.read_csv(input_path)

# 2. Handle missing values
df["sender"] = df["sender"].fillna("missing")
df["receiver"] = df["receiver"].fillna("missing")
df["subject"] = df["subject"].fillna("missing")
df["body"] = df["body"].fillna("")

# 3. Parse dates (with UTC to avoid mixed-timezone issues)
df["date"] = pd.to_datetime(df["date"], errors="coerce", utc=True)
df["date"] = df["date"].dt.tz_localize(None).fillna(pd.Timestamp("1970-01-01"))

# 4. Clean text columns
df["subject"] = df["subject"].apply(clean_text)
df["body"] = df["body"].apply(clean_text)
df["urls"] = df["urls"].astype(str).apply(clean_text)

# 5. Drop duplicates where both cleaned subject and body match exactly
df = df.drop_duplicates(subset=["subject", "body"], keep="first").reset_index(
drop=True
)

# 6. Save cleaned dataset
df.to_csv(output_path, index=False)
print(f"Done! Saved cleaned dataset to:\n {output_path}")


if __name__ == "__main__":
main()
3,504 changes: 3,504 additions & 0 deletions 2_data_preparation/cleaned_phishing_dataset.csv

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions 3_data_exploration/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,12 @@
# Data Exploration

Our datset contains 1551 phishing and 1497 safe emails.Below we visually summarize the total class phishing and safe emails

![Class Distribution](class_distribution.png)

Other important things we will be looking at in order to better understand our data is the frequency of words

![Frequent Words](top_words.png)

And the existence of URL is another aspect we aim to analyze in phishing vs non phishing emails
![URL_Existence](url_proportion.png)
Binary file added 3_data_exploration/class_distribution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 3_data_exploration/top_words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 3_data_exploration/url_proportion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# 🐼 The Pandas Pact

Welcome to the official repository of **The Pandas Pact** – a cross-cultural, collaborative team committed to learning, growing, and delivering impactful work together. This repository houses all project materials, documentation, meeting notes, and resources shared across our team.

---

## 🌍 About Us

**The Pandas Pact** is a diverse team of global learners working together with respect, integrity, and creativity. Our mission is to foster a safe, inclusive, and productive environment where every member contributes meaningfully while embracing cultural differences.

---

## ✅ Group Norms

Our norms guide how we work together. We believe in:

- Respecting each other's time and communication styles
- Open, constructive feedback
- Clear expectations and accountability
- Empathy, support, and trust in each other
- Regular reflection and adaptation of our practices

📄 You can read our full group norms (/collaboration/README.md)

---

## 💬 Communication

We coordinate and communicate using:

- **Slack Channel**: `#et6_cdsp_group_14`
- **Meetings**: Scheduled weekly or as needed via Zoom or Google Meet (details shared in Slack)
- **Documentation**: Shared and updated here in the repo
- **Meeting Notes**: Action items and summaries are posted in Slack and optionally stored in this repo
Loading
Loading