In [1]:
import pandas as pd
# Step 1: Load the dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])
# Optional: check dataset
print(df.head())
print(df['label'].value_counts())
label message 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro... label ham 4825 spam 747 Name: count, dtype: int64
In [2]:
# Step 2: Encode the labels
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['label_encoded'] = le.fit_transform(df['label']) # ham=0, spam=1
In [3]:
# Step 3: Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label_encoded'], test_size=0.2, random_state=42)
Vectorization¶
In machine learning, algorithms cannot work directly with raw text. Therefore, text must be converted into numerical representations. One common approach is Count Vectorization.
CountVectorizer converts a collection of text documents into a matrix of token counts. Each row represents a document (message), and each column represents a unique word from the corpus. The cell values indicate how many times each word appears in each document.
For example, for 3 messages:
['I love spam', 'Spam is bad', 'I love ham']
The vectorized form (simplified):
| Word | I | love | spam | is | bad | ham |
|---|---|---|---|---|---|---|
| Doc 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| Doc 2 | 0 | 0 | 1 | 1 | 1 | 0 |
| Doc 3 | 1 | 1 | 0 | 0 | 0 | 1 |
This matrix is used as input to the machine learning model.
In [7]:
# Step 4: Convert text to numeric features
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
In [8]:
# Step 5: Train a Naïve Bayes classifier
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_vec, y_train)
Out[8]:
MultinomialNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
In [9]:
# Step 6: Evaluate the model
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test_vec)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, target_names=le.classes_))
Accuracy: 0.9919
Classification Report:
precision recall f1-score support
ham 0.99 1.00 1.00 966
spam 1.00 0.94 0.97 149
accuracy 0.99 1115
macro avg 1.00 0.97 0.98 1115
weighted avg 0.99 0.99 0.99 1115