NLP-spam-filter

Learning project in Python for classifying text messages as spam or not spam using a Linear SVC (Support Vector Classifier). The model trains on the SMS Spam Collection Data Set from the UCI Machine Learning repository.

Usage

Packages

The following packages are required. I would strongly recommend using the pip3 package manager.

nltk
pandas
scipy
sklearn

Run

To install and run the program, enter the following terminal commands:

git clone https://www.github.com/jessicalally/NLP-spam-filter.git
cd NLP-spam-filter
python3 spam_classifier.py

The program will then prompt the user to input a message to be classified.

Examples

"Call 01234 567890 NOW for a free pizza"  // This message is spam
"Hi! How are you?"                        // This message is not spam

Analysis

Below are some metrics that display the performance of the model and some of the key terms it identifies in spam messages.

Confusion Matrix

Out of the 1115 examples used for testing, the model has classified 1094 messages correctly, indicating an accuracy of 0.925. Performance could have been improved by training using a larger dataset, and more comprehensive pre-processing of data.

		predicted
		spam	ham
actual	spam	965	1
actual	ham	20	129

Top Predictors

The following are the 20 top predictors of spam, according to the model.

Each term is a stemmed version of a real word or phrase to generate a more accurate model. A weight is generated by the model for each term which defines how important that term is in determining whether or not a message is spam.

The most influential term is phonenumb (symbolised version of a numerical phone number), meaning the model has identified this as a key factor in detecting spam. Another interesting term is currencysymbolnumbersymbol which symbolises an amount of currency, indicating spam messages will frequently ask or refer to money in their messages.

Term	Weight
phonenumb	5.099450
txt	2.842654
call phonenumb	2.378848
numbersymbol	2.220229
currencysymbolnumbersymbol	2.005241
free	1.821944
claim	1.810054
repli	1.804426
rington	1.801897
servic	1.783487
mobil	1.733426
text	1.620361
uk	1.562161
stop	1.527382
tone	1.494014
currencysymbolurladdress	1.436521
prize	1.284281
credit	1.259734
order	1.103612
poli	1.087950

References

Using natural language processing to build a spam filter for text messages - Red Huq, In Machines We Trust

Spam Filtering Emails: An Approach with Natural Language Processing - Kasumi Gunasekara, Medium @KasumiGunasekara

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
data		data
.gitignore		.gitignore
README.md		README.md
spam_classifier.py		spam_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP-spam-filter

Usage

Packages

Run

Examples

Analysis

Confusion Matrix

Top Predictors

References

About

Uh oh!

Releases

Packages

Languages

jessicalally/NLP-spam-filter

Folders and files

Latest commit

History

Repository files navigation

NLP-spam-filter

Usage

Packages

Run

Examples

Analysis

Confusion Matrix

Top Predictors

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages