Skip to content

jessicalally/NLP-spam-filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP-spam-filter

Learning project in Python for classifying text messages as spam or not spam using a Linear SVC (Support Vector Classifier). The model trains on the SMS Spam Collection Data Set from the UCI Machine Learning repository.

Usage

Packages

The following packages are required. I would strongly recommend using the pip3 package manager.

  • nltk
  • pandas
  • scipy
  • sklearn

Run

To install and run the program, enter the following terminal commands:

git clone https://www.github.com/jessicalally/NLP-spam-filter.git
cd NLP-spam-filter
python3 spam_classifier.py

The program will then prompt the user to input a message to be classified.

Examples

"Call 01234 567890 NOW for a free pizza"  // This message is spam
"Hi! How are you?"                        // This message is not spam

Analysis

Below are some metrics that display the performance of the model and some of the key terms it identifies in spam messages.

Confusion Matrix

Out of the 1115 examples used for testing, the model has classified 1094 messages correctly, indicating an accuracy of 0.925. Performance could have been improved by training using a larger dataset, and more comprehensive pre-processing of data.

predicted
spam ham
actual spam 965 1
ham 20 129

Top Predictors

The following are the 20 top predictors of spam, according to the model.

Each term is a stemmed version of a real word or phrase to generate a more accurate model. A weight is generated by the model for each term which defines how important that term is in determining whether or not a message is spam.

The most influential term is phonenumb (symbolised version of a numerical phone number), meaning the model has identified this as a key factor in detecting spam. Another interesting term is currencysymbolnumbersymbol which symbolises an amount of currency, indicating spam messages will frequently ask or refer to money in their messages.

Term Weight
phonenumb 5.099450
txt 2.842654
call phonenumb 2.378848
numbersymbol 2.220229
currencysymbolnumbersymbol 2.005241
free 1.821944
claim 1.810054
repli 1.804426
rington 1.801897
servic 1.783487
mobil 1.733426
text 1.620361
uk 1.562161
stop 1.527382
tone 1.494014
currencysymbolurladdress 1.436521
prize 1.284281
credit 1.259734
order 1.103612
poli 1.087950

References

Using natural language processing to build a spam filter for text messages - Red Huq, In Machines We Trust

Spam Filtering Emails: An Approach with Natural Language Processing - Kasumi Gunasekara, Medium @KasumiGunasekara

About

Project for classifying text messages as spam using SVM.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages