Naïve Bayes Classifier¶

Bayes’ Theorem¶

Bayes' Theorem is defined as:

$$ P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)} $$

Where:

  • $P(C|X)$: Posterior probability of class $C$ given features $X$
  • $P(X|C)$: Likelihood of features given class
  • $P(C)$: Prior probability of class
  • $P(X)$: Evidence (constant across classes)

Naïve Assumption¶

Naïve Bayes assumes that all features are conditionally independent given the class, allowing the posterior to be simplified as:

$$ P(C|x_1, x_2, ..., x_n) \propto P(C) \cdot \prod_{i=1}^{n} P(x_i|C) $$

Despite this strong independence assumption, Naïve Bayes often performs well in practice, especially for high-dimensional data like text.

Types of Naïve Bayes¶

  • MultinomialNB: For discrete features such as word counts
  • BernoulliNB: For binary/boolean features
  • GaussianNB: For continuous features assuming normal distribution
In [62]:
# import

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import pandas as pd
In [63]:
import pandas as pd

# dataset
data = {
    'Outlook': [
        'Sunny', 'Sunny', 'Overcast', 'Rainy', 'Sunny', 'Overcast', 'Rainy', 'Sunny', 'Sunny', 'Overcast',
        'Rainy', 'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Sunny', 'Rainy', 'Rainy', 'Sunny', 'Sunny',
        'Overcast', 'Rainy', 'Sunny', 'Sunny', 'Overcast', 'Rainy', 'Sunny', 'Rainy', 'Overcast', 'Sunny',
        'Sunny', 'Overcast', 'Rainy', 'Sunny', 'Sunny', 'Rainy', 'Overcast', 'Sunny', 'Rainy', 'Overcast',
        'Sunny', 'Rainy', 'Overcast', 'Rainy', 'Sunny', 'Overcast', 'Rainy', 'Sunny', 'Rainy', 'Sunny'
    ],
    'Temperature': [
        'Hot', 'Mild', 'Hot', 'Mild', 'Cool', 'Mild', 'Cool', 'Hot', 'Hot', 'Hot',
        'Mild', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Hot', 'Mild', 'Cool', 'Mild',
        'Hot', 'Mild', 'Cool', 'Hot', 'Mild', 'Cool', 'Hot', 'Hot', 'Mild', 'Mild',
        'Hot', 'Cool', 'Mild', 'Cool', 'Hot', 'Hot', 'Mild', 'Mild', 'Cool', 'Cool',
        'Mild', 'Cool', 'Hot', 'Mild', 'Hot', 'Mild', 'Cool', 'Mild', 'Hot', 'Mild'
    ],
    'Play': [
        'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes',
        'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No',
        'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No',
        'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes',
        'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No'
    ]
}

df = pd.DataFrame(data)
df
Out[63]:
Outlook Temperature Play
0 Sunny Hot No
1 Sunny Mild No
2 Overcast Hot Yes
3 Rainy Mild Yes
4 Sunny Cool Yes
5 Overcast Mild Yes
6 Rainy Cool Yes
7 Sunny Hot No
8 Sunny Hot Yes
9 Overcast Hot Yes
10 Rainy Mild Yes
11 Rainy Cool No
12 Sunny Cool Yes
13 Overcast Mild Yes
14 Overcast Cool Yes
15 Sunny Mild No
16 Rainy Hot Yes
17 Rainy Mild No
18 Sunny Cool Yes
19 Sunny Mild No
20 Overcast Hot Yes
21 Rainy Mild No
22 Sunny Cool Yes
23 Sunny Hot No
24 Overcast Mild Yes
25 Rainy Cool No
26 Sunny Hot Yes
27 Rainy Hot No
28 Overcast Mild Yes
29 Sunny Mild No
30 Sunny Hot Yes
31 Overcast Cool Yes
32 Rainy Mild Yes
33 Sunny Cool No
34 Sunny Hot Yes
35 Rainy Hot No
36 Overcast Mild Yes
37 Sunny Mild Yes
38 Rainy Cool No
39 Overcast Cool Yes
40 Sunny Mild No
41 Rainy Cool Yes
42 Overcast Hot Yes
43 Rainy Mild No
44 Sunny Hot Yes
45 Overcast Mild Yes
46 Rainy Cool Yes
47 Sunny Mild No
48 Rainy Hot Yes
49 Sunny Mild No
In [64]:
from sklearn.preprocessing import LabelEncoder

# Encode string labels into numbers
le_outlook = LabelEncoder()
le_temp = LabelEncoder()
le_play = LabelEncoder()

df['Outlook'] = le_outlook.fit_transform(df['Outlook'])       # Sunny=2, Overcast=0, Rainy=1
df['Temperature'] = le_temp.fit_transform(df['Temperature'])  # Cool=0, Hot=1, Mild=2
df['Play'] = le_play.fit_transform(df['Play'])                # No=0, Yes=1

df
Out[64]:
Outlook Temperature Play
0 2 1 0
1 2 2 0
2 0 1 1
3 1 2 1
4 2 0 1
5 0 2 1
6 1 0 1
7 2 1 0
8 2 1 1
9 0 1 1
10 1 2 1
11 1 0 0
12 2 0 1
13 0 2 1
14 0 0 1
15 2 2 0
16 1 1 1
17 1 2 0
18 2 0 1
19 2 2 0
20 0 1 1
21 1 2 0
22 2 0 1
23 2 1 0
24 0 2 1
25 1 0 0
26 2 1 1
27 1 1 0
28 0 2 1
29 2 2 0
30 2 1 1
31 0 0 1
32 1 2 1
33 2 0 0
34 2 1 1
35 1 1 0
36 0 2 1
37 2 2 1
38 1 0 0
39 0 0 1
40 2 2 0
41 1 0 1
42 0 1 1
43 1 2 0
44 2 1 1
45 0 2 1
46 1 0 1
47 2 2 0
48 1 1 1
49 2 2 0
In [65]:
from sklearn.naive_bayes import CategoricalNB

# Define features (X) and label (y)
X = df[['Outlook', 'Temperature']]
y = df['Play']

# Train the model
model = CategoricalNB()
model.fit(X, y)
Out[65]:
CategoricalNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CategoricalNB()
In [66]:
# Encode new input: Sunny, Hot
outlook_input = le_outlook.transform(['Sunny'])[0]
temp_input = le_temp.transform(['Hot'])[0]

# Use DataFrame to avoid warnings
input_df = pd.DataFrame([[outlook_input, temp_input]], columns=['Outlook', 'Temperature'])

# Predict
model.predict(input_df)
Out[66]:
array([1])
In [67]:
# Convert back to class label
predicted = model.predict(input_df)
le_play.inverse_transform(predicted)
Out[67]:
array(['Yes'], dtype=object)

What Happened¶

The model learned from the data how likely each combination of Outlook and Temperature is for each class (Play = Yes or No).

To predict, it calculates:

  • Prior: how common each class is (e.g., Yes/No)
  • Likelihood: how often each feature value occurs for each class
  • Multiplies them to find the most probable class
In [68]:
# Create a DataFrame for test samples

test_samples = {
    'Outlook': ['Sunny', 'Overcast', 'Rainy', 'Sunny'],
    'Temperature': ['Mild', 'Mild', 'Mild', 'Mild']
}

test_df = pd.DataFrame(test_samples)
print(test_df)
    Outlook Temperature
0     Sunny        Mild
1  Overcast        Mild
2     Rainy        Mild
3     Sunny        Mild
In [69]:
# Encode using the same label encoders used during training
test_df['Outlook'] = le_outlook.transform(test_df['Outlook'])
test_df['Temperature'] = le_temp.transform(test_df['Temperature'])

print(test_df)
   Outlook  Temperature
0        2            2
1        0            2
2        1            2
3        2            2
In [70]:
# Predict for all test samples
predicted = model.predict(test_df)

# Decode predicted labels (0/1 → No/Yes)
decoded = le_play.inverse_transform(predicted)

# Add predictions to the DataFrame
test_samples_result = test_df.copy()
test_samples_result['Play_Predicted'] = decoded
print(test_samples_result)
   Outlook  Temperature Play_Predicted
0        2            2             No
1        0            2            Yes
2        1            2             No
3        2            2             No

Adult Income Dataset from University of California Website¶

In [71]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
           'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
           'hours_per_week', 'native_country', 'income']
df = pd.read_csv(url, header=None, names=columns, na_values=' ?')


print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None
In [72]:
print(df.describe())
                age        fnlwgt  education_num  capital_gain  capital_loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours_per_week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  
In [73]:
print(df.head())
   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country  income  
0          2174             0              40   United-States   <=50K  
1             0             0              13   United-States   <=50K  
2             0             0              40   United-States   <=50K  
3             0             0              40   United-States   <=50K  
4             0             0              40            Cuba   <=50K  
In [74]:
print(df.isnull().sum())
age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     583
income               0
dtype: int64
In [75]:
print(df['income'].unique())
[' <=50K' ' >50K']
In [76]:
# Step 2: Preprocess the data
# Drop rows with missing values
df.dropna(inplace=True)

# Encode categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Define features and target
X = df.drop('income', axis=1)
y = df['income']
In [77]:
# Step 3: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [78]:
# Step 4: Train the Naïve Bayes classifier
model = CategoricalNB()
model.fit(X_train, y_train)
Out[78]:
CategoricalNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CategoricalNB()
In [79]:
# Step 5: Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(classification_report(y_test, y_pred))
Accuracy: 0.86
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.93      0.91      6767
           1       0.77      0.65      0.70      2282

    accuracy                           0.86      9049
   macro avg       0.83      0.79      0.81      9049
weighted avg       0.86      0.86      0.86      9049

Accuracy¶

Accuracy is the ratio of correctly predicted observations to the total observations:

Accuracy = (Correct predictions) / (Total predictions)
         = (True Positives + True Negatives) / Total
         = 0.86

This means 86% of all predictions made by the model were correct.

Classification Report¶

Metric Class 0 (<=50K) Class 1 (>50K)
Precision 0.89 0.77
Recall 0.93 0.65
F1-score 0.91 0.70
Support 6767 2282
  • Precision is the proportion of correct predictions among all predictions for that class. It answers: Of all the instances predicted as a class, how many were actually correct?

    $$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

  • Recall is the proportion of correct predictions among all actual samples of that class. It answers: Of all the actual instances of a class, how many did the model correctly identify?

    $$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

  • F1-score is the harmonic mean of precision and recall. It provides a single measure of a model’s performance when there is an uneven class distribution.

    $$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

  • Support is the number of actual instances of each class in the dataset. It gives context to the precision, recall, and F1-score by indicating how many samples the score is based on.

Class 0 (<=50K): High precision and recall, indicating strong performance on this class.

Class 1 (>50K): Lower recall and F1-score, meaning the model misses many true positives for this class.

Macro and Weighted Averages¶

Metric Macro Avg Weighted Avg
Precision 0.83 0.86
Recall 0.79 0.86
F1-score 0.81 0.86
  • Macro Average: Averages the metric across both classes equally, without considering class imbalance.
  • Weighted Average: Averages the metric across both classes, weighted by the number of instances in each class.

Since class 0 has significantly more samples, the weighted average is closer to the scores for class 0.

Summary¶

  • The model performs very well on class 0 (income <= 50K).
  • It performs worse on class 1 (income > 50K), especially in terms of recall.
  • The overall accuracy is high, but the model is biased toward the majority class.

Types of Naïve Bayes Models¶

Naïve Bayes classifiers are based on Bayes’ Theorem and assume independence between features. The choice of model depends on the nature of the input features.

Multinomial Naïve Bayes¶

Used when features represent counts or frequencies.

Example: Text classification where each feature is the count of a word in the document (e.g., spam detection using CountVectorizer).

Bernoulli Naïve Bayes¶

Used when features are binary (e.g., presence or absence of a word).

Example: Email classification using binary word occurrence features (word present = 1, absent = 0).

Gaussian Naïve Bayes¶

Used when features are continuous and normally distributed.

Example: Classifying patients as having a disease or not based on continuous variables like blood pressure, cholesterol level, and age.

Each model suits a different type of data: count data, binary indicators, or continuous measurements.