Learning project in Python for classifying text messages as spam or not spam using a Linear SVC (Support Vector Classifier). The model trains on the SMS Spam Collection Data Set from the UCI Machine Learning repository.
The following packages are required. I would strongly recommend using the pip3
package manager.
nltk
pandas
scipy
sklearn
To install and run the program, enter the following terminal commands:
git clone https://www.github.com/jessicalally/NLP-spam-filter.git
cd NLP-spam-filter
python3 spam_classifier.py
The program will then prompt the user to input a message to be classified.
"Call 01234 567890 NOW for a free pizza" // This message is spam
"Hi! How are you?" // This message is not spam
Below are some metrics that display the performance of the model and some of the key terms it identifies in spam messages.
Out of the 1115 examples used for testing, the model has classified 1094 messages correctly, indicating an accuracy of 0.925. Performance could have been improved by training using a larger dataset, and more comprehensive pre-processing of data.
predicted | |||
---|---|---|---|
spam | ham | ||
actual | spam | 965 | 1 |
ham | 20 | 129 |
The following are the 20 top predictors of spam, according to the model.
Each term is a stemmed version of a real word or phrase to generate a more accurate model. A weight is generated by the model for each term which defines how important that term is in determining whether or not a message is spam.
The most influential term is phonenumb (symbolised version of a numerical phone number), meaning the model has identified this as a key factor in detecting spam. Another interesting term is currencysymbolnumbersymbol which symbolises an amount of currency, indicating spam messages will frequently ask or refer to money in their messages.
Term | Weight |
---|---|
phonenumb | 5.099450 |
txt | 2.842654 |
call phonenumb | 2.378848 |
numbersymbol | 2.220229 |
currencysymbolnumbersymbol | 2.005241 |
free | 1.821944 |
claim | 1.810054 |
repli | 1.804426 |
rington | 1.801897 |
servic | 1.783487 |
mobil | 1.733426 |
text | 1.620361 |
uk | 1.562161 |
stop | 1.527382 |
tone | 1.494014 |
currencysymbolurladdress | 1.436521 |
prize | 1.284281 |
credit | 1.259734 |
order | 1.103612 |
poli | 1.087950 |
Using natural language processing to build a spam filter for text messages - Red Huq, In Machines We Trust
Spam Filtering Emails: An Approach with Natural Language Processing - Kasumi Gunasekara, Medium @KasumiGunasekara