Text Classification using Naïve Bayes, Logistic Regression, and SVM¶
In this notebook, we build text classifiers using three common ML algorithms:
- Naïve Bayes
- Logistic Regression
- Support Vector Machine (SVM)
We use a larger, balanced dataset and evaluate the models using classification metrics.
What is Vectorization?¶
Vectorization is the process of turning words or sentences into numbers so that a computer can understand and work with them. Computers cannot read or understand text like humans do—they need everything in numbers. So, in tasks like sentiment analysis, spam detection, or translation, we must first convert text into a numeric format. This is called vectorization.
Why Do We Need It? Imagine we have sentences like:
- I love this movie
- This movie is bad
A person can understand these easily. But for a computer, we have to represent them in a way it can use to learn. Vectorization helps us do that. Each sentence will be turned into a list of numbers (called a vector), and that vector will represent the meaning or pattern of the sentence.
Common Methods of Vectorization:¶
- Bag of Words (BoW): This is one of the simplest methods. It just counts how many times each word appears in a sentence. For example, if our vocabulary is [love, movie, bad], then:
- "I love this movie" becomes [1, 1, 0] (love and movie appear once, bad does not appear)
- "This movie is bad" becomes [0, 1, 1] (movie and bad appear once, love does not appear)
- TF-IDF (Term Frequency-Inverse Document Frequency): This method improves on Bag of Words. It still counts words, but it reduces the importance of very common words (like "is", "the") and increases the importance of rare, meaningful words (like "excellent", "terrible").
Why It's Called a Vector: Once a sentence is turned into a list of numbers, it becomes a vector—a term from math and physics meaning a line of numbers that points in a certain direction. In machine learning, each sentence is treated as a vector in a large space of possible meanings.
Drawbacks of Basic Vectorization:¶
No understanding of word meaning: Words like "good" and "great" are treated as completely unrelated, even though they mean the same thing.
No sense of word order: Sentences like "dog bites man" and "man bites dog" will have the same representation, even though they mean very different things.
Large and sparse: For a large vocabulary, the vectors can become very long with many zeros, which makes them inefficient.
Cannot handle unknown words: If a new word shows up that wasn’t seen in the training data, the vectorizer doesn’t know how to deal with it.
Vectorization is a necessary first step in almost every natural language processing task. It allows us to convert text into a form that machine learning models can understand. While basic methods like Bag of Words and TF-IDF are simple and effective for small projects, they have limitations that are solved by more advanced techniques like Word Embeddings and Transformers.
# Step 1: Import Required Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Create a Larger and Balanced Dataset¶
This dataset includes 25 text samples labeled as positive (1) or negative (0).
import pandas as pd
# Load data from reviews.csv
df = pd.read_csv("reviews.csv")
df.head()
| text | label | |
|---|---|---|
| 0 | Highly recommend | 1 |
| 1 | Worst product I’ve used | 0 |
| 2 | Would not recommend | 0 |
| 3 | Awesome | 1 |
| 4 | Helpful support team | 1 |
Step 3: Train-Test Split¶
We stratify the split to maintain class balance in both training and test sets.
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['label'], test_size=0.3, stratify=df['label'], random_state=42
)
Step 4: TF-IDF Vectorization¶
Convert text data into numerical features using TF-IDF.
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
Step 5: Naïve Bayes Classifier¶
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
y_pred_nb = nb_model.predict(X_test_tfidf)
print('Naïve Bayes Accuracy:', accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb, zero_division=0))
Naïve Bayes Accuracy: 1.0
precision recall f1-score support
0 1.00 1.00 1.00 144
1 1.00 1.00 1.00 156
accuracy 1.00 300
macro avg 1.00 1.00 1.00 300
weighted avg 1.00 1.00 1.00 300
Step 6: Logistic Regression¶
lr_model = LogisticRegression()
lr_model.fit(X_train_tfidf, y_train)
y_pred_lr = lr_model.predict(X_test_tfidf)
print('Logistic Regression Accuracy:', accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr, zero_division=0))
Logistic Regression Accuracy: 1.0
precision recall f1-score support
0 1.00 1.00 1.00 144
1 1.00 1.00 1.00 156
accuracy 1.00 300
macro avg 1.00 1.00 1.00 300
weighted avg 1.00 1.00 1.00 300
Step 7: Support Vector Machine (SVM)¶
svm_model = LinearSVC()
svm_model.fit(X_train_tfidf, y_train)
y_pred_svm = svm_model.predict(X_test_tfidf)
print('SVM Accuracy:', accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm, zero_division=0))
SVM Accuracy: 1.0
precision recall f1-score support
0 1.00 1.00 1.00 144
1 1.00 1.00 1.00 156
accuracy 1.00 300
macro avg 1.00 1.00 1.00 300
weighted avg 1.00 1.00 1.00 300
Step 8: Confusion Matrix Comparison¶
def plot_cm(y_true, y_pred, title):
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(title)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
plot_cm(y_test, y_pred_nb, 'Naïve Bayes Confusion Matrix')
plot_cm(y_test, y_pred_lr, 'Logistic Regression Confusion Matrix')
plot_cm(y_test, y_pred_svm, 'SVM Confusion Matrix')