In [1]:
import pandas as pd

# Step 1: Load the dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])

# Optional: check dataset
print(df.head())
print(df['label'].value_counts())
  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
label
ham     4825
spam     747
Name: count, dtype: int64
In [2]:
# Step 2: Encode the labels
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['label_encoded'] = le.fit_transform(df['label'])  # ham=0, spam=1
In [3]:
# Step 3: Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label_encoded'], test_size=0.2, random_state=42)

Vectorization¶

In machine learning, algorithms cannot work directly with raw text. Therefore, text must be converted into numerical representations. One common approach is Count Vectorization.

CountVectorizer converts a collection of text documents into a matrix of token counts. Each row represents a document (message), and each column represents a unique word from the corpus. The cell values indicate how many times each word appears in each document.

For example, for 3 messages:

['I love spam', 'Spam is bad', 'I love ham']

The vectorized form (simplified):

Word I love spam is bad ham
Doc 1 1 1 1 0 0 0
Doc 2 0 0 1 1 1 0
Doc 3 1 1 0 0 0 1

This matrix is used as input to the machine learning model.

In [7]:
# Step 4: Convert text to numeric features
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
In [8]:
# Step 5: Train a Naïve Bayes classifier
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_vec, y_train)
Out[8]:
MultinomialNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
In [9]:
 # Step 6: Evaluate the model
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test_vec)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, target_names=le.classes_))
Accuracy: 0.9919

Classification Report:

              precision    recall  f1-score   support

         ham       0.99      1.00      1.00       966
        spam       1.00      0.94      0.97       149

    accuracy                           0.99      1115
   macro avg       1.00      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115