Let's create a Spam classifier using Naive Bayes and Tf-IDF Vectorizer.
Loading the Dataset
Dataset can be downloaded from Kaggle. Lots of datasets are available in Kaggle of different messages and the type you may receive sometimes.
df['type'].value_counts()ham 4812spam 747Name: type, dtype: int64
Dataset can be fetched by the read_csv function in pandas. This particular dataset contains 5559 rows of 747 Spam and 4812 Ham messages. However, one thing is missing, we need the labels in numbers, for instance, we can add another column to the dataset containing 0 and 1 for ham and spam respectively.
# assigning labels to the dataset, 0 for ham and 1 for spamdf['label_num'] = df['type'].apply(lambda x: 1 if x == 'spam' else 0)
Now if you run this code it will create a new column at the end with the specific labels for each message.
Splitting the dataset
Usually, we split the entire dataset into train and test data for the model to perform. The model is trained by the training data and its corresponding labels. Then it can be tested with the test data to check whether the model is good at predicting or not.
#splitting the datasetfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df['text'], df['label_num'], test_size=0.3, random_state=42)
rows in train and test set -
print("train set:", X_train.shape) # rows in train setprint("test set:", X_test.shape) # rows in test set
Tf-IDF Vectorization
Tf-IDF is one of the efficient statistical methods to figure out the words that are relevant in a text, sentence, or paragraph. We know that the messages are texts, however, computers aren't good at texts so Tf-IDF helps to convert the texts to corresponding numerical values so as to fit the data to the model.
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNB# Converting X_train to a listlst = X_train.tolist()# Applying Tf-IDF Vectorizationvectorizer = TfidfVectorizer(input = lst, lowercase = True, stop_words = "english")train_transformed = vectorizer.fit_transform(X_train)test_transformed = vectorizer.transform(X_test)# Fit the transformed train data to the model.model = MultinomialNB()model.fit(train_transformed, y_train)
Here we applied the Tf-IDF vectorizer and fitted the transformed train data to the Multinomial Naive Bayes classifier. Now let's look at the predictions made by the model.
Run this code to see the predictions and compare them with actual values.
prediction = model.predict(test_transformed)actual = y_testprint("Prediction:", list(prediction))print("Actual: ",list(actual))
Evaluating the Model
Confusion matrix:-
from sklearn.metrics import confusion_matrixmatrix = confusion_matrix(prediction, actual)matrixarray([[1457, 52],[ 0, 159]], dtype=int64)
precision = matrix[1][1]/(matrix[1][1]+matrix[0][1])recall = matrix[1][1]/(matrix[1][1]+matrix[1][0])f1score = matrix[1][1]/(matrix[1][1]+(matrix[1][0]+(matrix[0][1]/2)))print("precision score:", precision)print("recall score:", recall)print("f1_score:", f1score)precision score: 0.7535545023696683recall score: 1.0f1_score: 0.8594594594594595
The scores are not bad at all. Now if you look at the recall score(sensitivity) and the precision score is 1.0 and 0.75.. respectively. The recall is so high that the model can identify most of the True Positives. This means the model is able to classify most of the spam messages including some ham messages. However, the precision and f1 scores are also not bad.
Let's predict some real messages. Here are some messages that I received in the past.
Congragulations! You have won a $10,000. Go to https://bit.ly/23343 to claim now.Get $10 Amazon Gift Voucher on Completing the Demo:- va.pcb3.in/ click this link to claim nowYou have won a $500. Please register your account today itself to claim this offer https://imp.com.Please dont respond to missed calls from unknown international numbers Call/ SMS on winning prize. lottery as this may be fraudulent call.
messages = ["Congragulations! You have won a $10,000. Go to https://bit.ly/23343 to claim now.","Get $10 Amazon Gift Voucher on Completing the Demo:- va.pcb3.in/ click this link to claim now","You have won a $500. Please register your account today itself to claim now https://imp.com","Please dont respond to missed calls from unknown international numbers Call/ SMS on winning prize. lottery as this may be fraudulent call."]message_transformed = vectorizer.transform(messages)new_prediction = model.predict(message_transformed)for i in range(len(new_prediction)):if new_prediction[i] == 0:print("Ham.")else:print("Spam.")Spam.Spam.Spam.Ham.
The first three messages I received were spam and the last one is probably ham. The model prediction is correct when I run this code. However, this model may not be that perfect for a real-life application since this is a basic understanding of spam classification. but you can optimize the model with some of the performance measures like Grid Search. Try to collect more datasets and work with some of the good datasets available there on platforms like Kaggle.